This is a note to let you know that I've just added the patch titled
usb: chipidea: core: fix possible constant 0 if use
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From f96c0384047257365371a8ac217107c0371586f1 Mon Sep 17 00:00:00 2001
From: Xu Yang <xu.yang_2(a)nxp.com>
Date: Thu, 15 Dec 2022 13:54:09 +0800
Subject: usb: chipidea: core: fix possible constant 0 if use
IS_ERR(ci->role_switch)
After successfully probed, ci->role_switch would only be NULL or a valid
pointer. IS_ERR(ci->role_switch) will always return 0. So no need to wrap
it with IS_ERR, otherwise the logic is wrong.
Fixes: e1b5d2bed67c ("usb: chipidea: core: handle usb role switch in a common way")
cc: <stable(a)vger.kernel.org>
Signed-off-by: Xu Yang <xu.yang_2(a)nxp.com>
Link: https://lore.kernel.org/r/20221215055409.3760523-1-xu.yang_2@nxp.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/chipidea/core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/usb/chipidea/core.c b/drivers/usb/chipidea/core.c
index 484b1cd23431..27c601296130 100644
--- a/drivers/usb/chipidea/core.c
+++ b/drivers/usb/chipidea/core.c
@@ -1294,12 +1294,12 @@ static void ci_extcon_wakeup_int(struct ci_hdrc *ci)
cable_id = &ci->platdata->id_extcon;
cable_vbus = &ci->platdata->vbus_extcon;
- if ((!IS_ERR(cable_id->edev) || !IS_ERR(ci->role_switch))
+ if ((!IS_ERR(cable_id->edev) || ci->role_switch)
&& ci->is_otg &&
(otgsc & OTGSC_IDIE) && (otgsc & OTGSC_IDIS))
ci_irq(ci);
- if ((!IS_ERR(cable_vbus->edev) || !IS_ERR(ci->role_switch))
+ if ((!IS_ERR(cable_vbus->edev) || ci->role_switch)
&& ci->is_otg &&
(otgsc & OTGSC_BSVIE) && (otgsc & OTGSC_BSVIS))
ci_irq(ci);
--
2.39.0
This is a note to let you know that I've just added the patch titled
USB: gadgetfs: Fix race between mounting and unmounting
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From d18dcfe9860e842f394e37ba01ca9440ab2178f4 Mon Sep 17 00:00:00 2001
From: Alan Stern <stern(a)rowland.harvard.edu>
Date: Fri, 23 Dec 2022 09:59:09 -0500
Subject: USB: gadgetfs: Fix race between mounting and unmounting
The syzbot fuzzer and Gerald Lee have identified a use-after-free bug
in the gadgetfs driver, involving processes concurrently mounting and
unmounting the gadgetfs filesystem. In particular, gadgetfs_fill_super()
can race with gadgetfs_kill_sb(), causing the latter to deallocate
the_device while the former is using it. The output from KASAN says,
in part:
BUG: KASAN: use-after-free in instrument_atomic_read_write include/linux/instrumented.h:102 [inline]
BUG: KASAN: use-after-free in atomic_fetch_sub_release include/linux/atomic/atomic-instrumented.h:176 [inline]
BUG: KASAN: use-after-free in __refcount_sub_and_test include/linux/refcount.h:272 [inline]
BUG: KASAN: use-after-free in __refcount_dec_and_test include/linux/refcount.h:315 [inline]
BUG: KASAN: use-after-free in refcount_dec_and_test include/linux/refcount.h:333 [inline]
BUG: KASAN: use-after-free in put_dev drivers/usb/gadget/legacy/inode.c:159 [inline]
BUG: KASAN: use-after-free in gadgetfs_kill_sb+0x33/0x100 drivers/usb/gadget/legacy/inode.c:2086
Write of size 4 at addr ffff8880276d7840 by task syz-executor126/18689
CPU: 0 PID: 18689 Comm: syz-executor126 Not tainted 6.1.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
<TASK>
...
atomic_fetch_sub_release include/linux/atomic/atomic-instrumented.h:176 [inline]
__refcount_sub_and_test include/linux/refcount.h:272 [inline]
__refcount_dec_and_test include/linux/refcount.h:315 [inline]
refcount_dec_and_test include/linux/refcount.h:333 [inline]
put_dev drivers/usb/gadget/legacy/inode.c:159 [inline]
gadgetfs_kill_sb+0x33/0x100 drivers/usb/gadget/legacy/inode.c:2086
deactivate_locked_super+0xa7/0xf0 fs/super.c:332
vfs_get_super fs/super.c:1190 [inline]
get_tree_single+0xd0/0x160 fs/super.c:1207
vfs_get_tree+0x88/0x270 fs/super.c:1531
vfs_fsconfig_locked fs/fsopen.c:232 [inline]
The simplest solution is to ensure that gadgetfs_fill_super() and
gadgetfs_kill_sb() are serialized by making them both acquire a new
mutex.
Signed-off-by: Alan Stern <stern(a)rowland.harvard.edu>
Reported-and-tested-by: syzbot+33d7ad66d65044b93f16(a)syzkaller.appspotmail.com
Reported-and-tested-by: Gerald Lee <sundaywind2004(a)gmail.com>
Link: https://lore.kernel.org/linux-usb/CAO3qeMVzXDP-JU6v1u5Ags6Q-bb35kg3=C6d04Dj…
Fixes: e5d82a7360d1 ("vfs: Convert gadgetfs to use the new mount API")
CC: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/Y6XCPXBpn3tmjdCC@rowland.harvard.edu
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/gadget/legacy/inode.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)
diff --git a/drivers/usb/gadget/legacy/inode.c b/drivers/usb/gadget/legacy/inode.c
index 01c3ead7d1b4..d605bc2e7e8f 100644
--- a/drivers/usb/gadget/legacy/inode.c
+++ b/drivers/usb/gadget/legacy/inode.c
@@ -229,6 +229,7 @@ static void put_ep (struct ep_data *data)
*/
static const char *CHIP;
+static DEFINE_MUTEX(sb_mutex); /* Serialize superblock operations */
/*----------------------------------------------------------------------*/
@@ -2010,13 +2011,20 @@ gadgetfs_fill_super (struct super_block *sb, struct fs_context *fc)
{
struct inode *inode;
struct dev_data *dev;
+ int rc;
- if (the_device)
- return -ESRCH;
+ mutex_lock(&sb_mutex);
+
+ if (the_device) {
+ rc = -ESRCH;
+ goto Done;
+ }
CHIP = usb_get_gadget_udc_name();
- if (!CHIP)
- return -ENODEV;
+ if (!CHIP) {
+ rc = -ENODEV;
+ goto Done;
+ }
/* superblock */
sb->s_blocksize = PAGE_SIZE;
@@ -2053,13 +2061,17 @@ gadgetfs_fill_super (struct super_block *sb, struct fs_context *fc)
* from binding to a controller.
*/
the_device = dev;
- return 0;
+ rc = 0;
+ goto Done;
-Enomem:
+ Enomem:
kfree(CHIP);
CHIP = NULL;
+ rc = -ENOMEM;
- return -ENOMEM;
+ Done:
+ mutex_unlock(&sb_mutex);
+ return rc;
}
/* "mount -t gadgetfs path /dev/gadget" ends up here */
@@ -2081,6 +2093,7 @@ static int gadgetfs_init_fs_context(struct fs_context *fc)
static void
gadgetfs_kill_sb (struct super_block *sb)
{
+ mutex_lock(&sb_mutex);
kill_litter_super (sb);
if (the_device) {
put_dev (the_device);
@@ -2088,6 +2101,7 @@ gadgetfs_kill_sb (struct super_block *sb)
}
kfree(CHIP);
CHIP = NULL;
+ mutex_unlock(&sb_mutex);
}
/*----------------------------------------------------------------------*/
--
2.39.0
This is a note to let you know that I've just added the patch titled
usb: cdns3: remove fetched trb from cache before dequeuing
to my usb git tree which can be found at
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git
in the usb-linus branch.
The patch will show up in the next release of the linux-next tree
(usually sometime within the next 24 hours during the week.)
The patch will hopefully also be merged in Linus's tree for the
next -rc kernel release.
If you have any questions about this process, please let me know.
From 1301c7b9f7efad2f11ef924e317c18ebd714fc9a Mon Sep 17 00:00:00 2001
From: Pawel Laszczak <pawell(a)cadence.com>
Date: Tue, 15 Nov 2022 05:00:39 -0500
Subject: usb: cdns3: remove fetched trb from cache before dequeuing
After doorbell DMA fetches the TRB. If during dequeuing request
driver changes NORMAL TRB to LINK TRB but doesn't delete it from
controller cache then controller will handle cached TRB and packet
can be lost.
The example scenario for this issue looks like:
1. queue request - set doorbell
2. dequeue request
3. send OUT data packet from host
4. Device will accept this packet which is unexpected
5. queue new request - set doorbell
6. Device lost the expected packet.
By setting DFLUSH controller clears DRDY bit and stop DMA transfer.
Fixes: 7733f6c32e36 ("usb: cdns3: Add Cadence USB3 DRD Driver")
cc: <stable(a)vger.kernel.org>
Signed-off-by: Pawel Laszczak <pawell(a)cadence.com>
Acked-by: Peter Chen <peter.chen(a)kernel.org>
Link: https://lore.kernel.org/r/20221115100039.441295-1-pawell@cadence.com
Signed-off-by: Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
---
drivers/usb/cdns3/cdns3-gadget.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/usb/cdns3/cdns3-gadget.c b/drivers/usb/cdns3/cdns3-gadget.c
index 5adcb349718c..ccfaebca6faa 100644
--- a/drivers/usb/cdns3/cdns3-gadget.c
+++ b/drivers/usb/cdns3/cdns3-gadget.c
@@ -2614,6 +2614,7 @@ int cdns3_gadget_ep_dequeue(struct usb_ep *ep,
u8 req_on_hw_ring = 0;
unsigned long flags;
int ret = 0;
+ int val;
if (!ep || !request || !ep->desc)
return -EINVAL;
@@ -2649,6 +2650,13 @@ int cdns3_gadget_ep_dequeue(struct usb_ep *ep,
/* Update ring only if removed request is on pending_req_list list */
if (req_on_hw_ring && link_trb) {
+ /* Stop DMA */
+ writel(EP_CMD_DFLUSH, &priv_dev->regs->ep_cmd);
+
+ /* wait for DFLUSH cleared */
+ readl_poll_timeout_atomic(&priv_dev->regs->ep_cmd, val,
+ !(val & EP_CMD_DFLUSH), 1, 1000);
+
link_trb->buffer = cpu_to_le32(TRB_BUFFER(priv_ep->trb_pool_dma +
((priv_req->end_trb + 1) * TRB_SIZE)));
link_trb->control = cpu_to_le32((le32_to_cpu(link_trb->control) & TRB_CYCLE) |
@@ -2660,6 +2668,10 @@ int cdns3_gadget_ep_dequeue(struct usb_ep *ep,
cdns3_gadget_giveback(priv_ep, priv_req, -ECONNRESET);
+ req = cdns3_next_request(&priv_ep->pending_req_list);
+ if (req)
+ cdns3_rearm_transfer(priv_ep, 1);
+
not_found:
spin_unlock_irqrestore(&priv_dev->lock, flags);
return ret;
--
2.39.0
The commit e00b488e813f ("usb-storage: Add Hiksemi USB3-FW to IGNORE_UAS")
blacklists UAS for all of RTL9210 enclosures.
The RTL9210 controller was advertised with UAS since its release back in
2019 and was shipped with a lot of enclosure products with different
firmware combinations.
Blacklist UAS only for HIKSEMI MD202.
This should hopefully be replaced with more robust method than just
comparing strings. But with limited information [1] provided thus far
(dmesg when the device is plugged in, which includes manufacturer and
product, but no lsusb -v to compare against), this is the best we can do
for now.
[1] https://lore.kernel.org/all/20230109115550.71688-1-qkrwngud825@gmail.com
Fixes: e00b488e813f ("usb-storage: Add Hiksemi USB3-FW to IGNORE_UAS")
Cc: Alan Stern <stern(a)rowland.harvard.edu>
Cc: Hongling Zeng <zenghongling(a)kylinos.cn>
Cc: stable(a)vger.kernel.org
Signed-off-by: Juhyung Park <qkrwngud825(a)gmail.com>
---
drivers/usb/storage/uas-detect.h | 13 +++++++++++++
drivers/usb/storage/unusual_uas.h | 7 -------
2 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/drivers/usb/storage/uas-detect.h b/drivers/usb/storage/uas-detect.h
index 3f720faa6f97..d73282c0ec50 100644
--- a/drivers/usb/storage/uas-detect.h
+++ b/drivers/usb/storage/uas-detect.h
@@ -116,6 +116,19 @@ static int uas_use_uas_driver(struct usb_interface *intf,
if (le16_to_cpu(udev->descriptor.idVendor) == 0x0bc2)
flags |= US_FL_NO_ATA_1X;
+ /*
+ * RTL9210-based enclosure from HIKSEMI, MD202 reportedly have issues
+ * with UAS. This isn't distinguishable with just idVendor and
+ * idProduct, use manufacturer and product too.
+ *
+ * Reported-by: Hongling Zeng <zenghongling(a)kylinos.cn>
+ */
+ if (le16_to_cpu(udev->descriptor.idVendor) == 0x0bda &&
+ le16_to_cpu(udev->descriptor.idProduct) == 0x9210 &&
+ (udev->manufacturer && !strcmp(udev->manufacturer, "HIKSEMI")) &&
+ (udev->product && !strcmp(udev->product, "MD202")))
+ flags |= US_FL_IGNORE_UAS;
+
usb_stor_adjust_quirks(udev, &flags);
if (flags & US_FL_IGNORE_UAS) {
diff --git a/drivers/usb/storage/unusual_uas.h b/drivers/usb/storage/unusual_uas.h
index 251778d14e2d..c7b763d6d102 100644
--- a/drivers/usb/storage/unusual_uas.h
+++ b/drivers/usb/storage/unusual_uas.h
@@ -83,13 +83,6 @@ UNUSUAL_DEV(0x0bc2, 0x331a, 0x0000, 0x9999,
USB_SC_DEVICE, USB_PR_DEVICE, NULL,
US_FL_NO_REPORT_LUNS),
-/* Reported-by: Hongling Zeng <zenghongling(a)kylinos.cn> */
-UNUSUAL_DEV(0x0bda, 0x9210, 0x0000, 0x9999,
- "Hiksemi",
- "External HDD",
- USB_SC_DEVICE, USB_PR_DEVICE, NULL,
- US_FL_IGNORE_UAS),
-
/* Reported-by: Benjamin Tissoires <benjamin.tissoires(a)redhat.com> */
UNUSUAL_DEV(0x13fd, 0x3940, 0x0000, 0x9999,
"Initio Corporation",
--
2.39.0
This is the start of the stable review cycle for the 5.10.164 release.
There are 64 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Wed, 18 Jan 2023 15:47:28 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.164-r…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-5.10.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 5.10.164-rc1
Ferry Toth <ftoth(a)exalondelft.nl>
Revert "usb: ulpi: defer ulpi_register on ulpi_read_id timeout"
Jens Axboe <axboe(a)kernel.dk>
io_uring/io-wq: only free worker if it was allocated for creation
Jens Axboe <axboe(a)kernel.dk>
io_uring/io-wq: free worker if task_work creation is canceled
Rob Clark <robdclark(a)chromium.org>
drm/virtio: Fix GEM handle creation UAF
Johan Hovold <johan+linaro(a)kernel.org>
efi: fix NULL-deref in init error path
Mark Rutland <mark.rutland(a)arm.com>
arm64: cmpxchg_double*: hazard against entire exchange variable
Mark Rutland <mark.rutland(a)arm.com>
arm64: atomics: remove LL/SC trampolines
Mark Rutland <mark.rutland(a)arm.com>
arm64: atomics: format whitespace consistently
Peter Newman <peternewman(a)google.com>
x86/resctrl: Fix task CLOSID/RMID update race
Reinette Chatre <reinette.chatre(a)intel.com>
x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI
Paolo Bonzini <pbonzini(a)redhat.com>
KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID
Paolo Bonzini <pbonzini(a)redhat.com>
Documentation: KVM: add API issues section
Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe()
Yong Wu <yong.wu(a)mediatek.com>
iommu/mediatek-v1: Add error handle for mtk_iommu_probe
Aaron Thompson <dev(a)aaront.org>
mm: Always release pages to the buddy allocator in memblock_free_late().
Gavin Li <gavinl(a)nvidia.com>
net/mlx5e: Don't support encap rules with gbp option
Rahul Rameshbabu <rrameshbabu(a)nvidia.com>
net/mlx5: Fix ptp max frequency adjustment range
Ido Schimmel <idosch(a)nvidia.com>
net/sched: act_mpls: Fix warning during failed attribute validation
Minsuk Kang <linuxlovemin(a)yonsei.ac.kr>
nfc: pn533: Wait for out_urb's completion in pn533_usb_send_frame()
Roger Pau Monne <roger.pau(a)citrix.com>
hvc/xen: lock console list traversal
Angela Czubak <aczubak(a)marvell.com>
octeontx2-af: Fix LMAC config in cgx_lmac_rx_tx_enable
Subbaraya Sundeep <sbhatta(a)marvell.com>
octeontx2-af: Map NIX block from CGX connection
Subbaraya Sundeep <sbhatta(a)marvell.com>
octeontx2-af: Update get/set resource count functions
Tung Nguyen <tung.q.nguyen(a)dektech.com.au>
tipc: fix unexpected link reset due to discovery messages
Emanuele Ghidoli <emanuele.ghidoli(a)toradex.com>
ASoC: wm8904: fix wrong outputs volume after power reactivation
Ricardo Ribalda <ribalda(a)chromium.org>
regulator: da9211: Use irq handler when ready
Eliav Farber <farbere(a)amazon.com>
EDAC/device: Fix period calculation in edac_device_reset_delay_period()
Peter Zijlstra <peterz(a)infradead.org>
x86/boot: Avoid using Intel mnemonics in AT&T syntax asm
Kajol Jain <kjain(a)linux.ibm.com>
powerpc/imc-pmu: Fix use of mutex in IRQs disabled section
Gavrilov Ilia <Ilia.Gavrilov(a)infotecs.ru>
netfilter: ipset: Fix overflow before widen in the bitmap_ip_create() function.
Nicolas Dichtel <nicolas.dichtel(a)6wind.com>
xfrm: fix rcu lock in xfrm_notify_userpolicy()
Ye Bin <yebin10(a)huawei.com>
ext4: fix uninititialized value in 'ext4_evict_inode'
Ferry Toth <ftoth(a)exalondelft.nl>
usb: ulpi: defer ulpi_register on ulpi_read_id timeout
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: Prevent infinite loop in transaction errors recovery for streams
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: move and rename xhci_cleanup_halted_endpoint()
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: store TD status in the td struct instead of passing it along
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: move xhci_td_cleanup so it can be called by more functions
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: Add xhci_reset_halted_ep() helper function
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: adjust parameters passed to cleanup_halted_endpoint()
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: get isochronous ring directly from endpoint structure
Mathias Nyman <mathias.nyman(a)linux.intel.com>
xhci: Avoid parsing transfer events several times
Li Jun <jun.li(a)nxp.com>
clk: imx: imx8mp: add shared clk gate for usb suspend clk
Li Jun <jun.li(a)nxp.com>
dt-bindings: clocks: imx8mp: Add ID for usb suspend clock
Lucas Stach <l.stach(a)pengutronix.de>
clk: imx8mp: add clkout1/2 support
Marek Vasut <marex(a)denx.de>
clk: imx8mp: Add DISP2 pixel clock
Kim Phillips <kim.phillips(a)amd.com>
iommu/amd: Fix ill-formed ivrs_ioapic, ivrs_hpet and ivrs_acpihid options
Suravee Suthikulpanit <suravee.suthikulpanit(a)amd.com>
iommu/amd: Add PCI segment support for ivrs_[ioapic/hpet/acpihid] commands
Qiang Yu <quic_qianyu(a)quicinc.com>
bus: mhi: host: Fix race between channel preparation and M0 event
Herbert Xu <herbert(a)gondor.apana.org.au>
ipv6: raw: Deduct extension header length in rawv6_push_pending_frames
Yang Yingliang <yangyingliang(a)huawei.com>
ixgbe: fix pci device refcount leak
Hans de Goede <hdegoede(a)redhat.com>
platform/x86: sony-laptop: Don't turn off 0x153 keyboard backlight during probe
Kuogee Hsieh <quic_khsieh(a)quicinc.com>
drm/msm/dp: do not complete dp_aux_cmd_fifo_tx() if irq is not for aux transfer
Konrad Dybcio <konrad.dybcio(a)linaro.org>
drm/msm/adreno: Make adreno quirks not overwrite each other
Volker Lendecke <vl(a)samba.org>
cifs: Fix uninitialized memory read for smb311 posix symlink create
Heiko Carstens <hca(a)linux.ibm.com>
s390/percpu: add READ_ONCE() to arch_this_cpu_to_op_simple()
Heiko Carstens <hca(a)linux.ibm.com>
s390/cpum_sf: add READ_ONCE() semantics to compare and swap loops
Brian Norris <computersforpeace(a)gmail.com>
ASoC: qcom: lpass-cpu: Fix fallback SD line index handling
Alexander Egorenkov <egorenar(a)linux.ibm.com>
s390/kexec: fix ipl report address for kdump
Adrian Hunter <adrian.hunter(a)intel.com>
perf auxtrace: Fix address filter duplicate symbol selection
Jonathan Corbet <corbet(a)lwn.net>
docs: Fix the docs build with Sphinx 6.0
Ard Biesheuvel <ardb(a)kernel.org>
efi: tpm: Avoid READ_ONCE() for accessing the event log
Marc Zyngier <maz(a)kernel.org>
KVM: arm64: Fix S1PTW handling on RO memslots
Luka Guzenko <l.guzenko(a)web.de>
ALSA: hda/realtek: Enable mute/micmute LEDs on HP Spectre x360 13-aw0xxx
Pablo Neira Ayuso <pablo(a)netfilter.org>
netfilter: nft_payload: incorrect arithmetics when fetching VLAN header bits
-------------
Diffstat:
Documentation/admin-guide/kernel-parameters.txt | 51 +++-
Documentation/sphinx/load_config.py | 6 +-
Documentation/virt/kvm/api.rst | 60 +++++
Makefile | 4 +-
arch/arm64/include/asm/atomic_ll_sc.h | 114 ++++----
arch/arm64/include/asm/atomic_lse.h | 16 +-
arch/arm64/include/asm/kvm_emulate.h | 22 +-
arch/powerpc/include/asm/imc-pmu.h | 2 +-
arch/powerpc/perf/imc-pmu.c | 136 +++++-----
arch/s390/include/asm/cpu_mf.h | 31 ++-
arch/s390/include/asm/percpu.h | 2 +-
arch/s390/kernel/machine_kexec_file.c | 5 +-
arch/s390/kernel/perf_cpum_sf.c | 101 ++++---
arch/x86/boot/bioscall.S | 4 +-
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 26 +-
arch/x86/kvm/cpuid.c | 32 +--
drivers/bus/mhi/core/pm.c | 3 +-
drivers/clk/imx/clk-imx8mp.c | 23 +-
drivers/edac/edac_device.c | 17 +-
drivers/edac/edac_module.h | 2 +-
drivers/firmware/efi/efi.c | 9 +-
drivers/gpu/drm/msm/adreno/adreno_gpu.h | 10 +-
drivers/gpu/drm/msm/dp/dp_aux.c | 4 +
drivers/gpu/drm/virtio/virtgpu_ioctl.c | 10 +-
drivers/iommu/amd/init.c | 89 ++++--
drivers/iommu/mtk_iommu_v1.c | 26 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c | 14 +-
drivers/net/ethernet/marvell/octeontx2/af/cgx.c | 17 +-
drivers/net/ethernet/marvell/octeontx2/af/cgx.h | 6 +-
drivers/net/ethernet/marvell/octeontx2/af/rvu.c | 134 ++++++++--
drivers/net/ethernet/marvell/octeontx2/af/rvu.h | 4 +
.../net/ethernet/marvell/octeontx2/af/rvu_cgx.c | 15 ++
.../net/ethernet/marvell/octeontx2/af/rvu_nix.c | 21 +-
.../ethernet/mellanox/mlx5/core/en/tc_tun_vxlan.c | 2 +
.../net/ethernet/mellanox/mlx5/core/lib/clock.c | 2 +-
drivers/nfc/pn533/usb.c | 44 ++-
drivers/platform/x86/sony-laptop.c | 21 +-
drivers/regulator/da9211-regulator.c | 11 +-
drivers/tty/hvc/hvc_xen.c | 46 ++--
drivers/usb/host/xhci-mem.c | 4 +
drivers/usb/host/xhci-ring.c | 297 +++++++++++----------
drivers/usb/host/xhci.h | 6 +-
fs/cifs/link.c | 1 +
fs/ext4/super.c | 1 +
include/dt-bindings/clock/imx8mp-clock.h | 10 +-
include/linux/tpm_eventlog.h | 4 +-
io_uring/io-wq.c | 6 +
mm/memblock.c | 8 +-
net/ipv6/raw.c | 4 +
net/netfilter/ipset/ip_set_bitmap_ip.c | 4 +-
net/netfilter/nft_payload.c | 2 +-
net/sched/act_mpls.c | 8 +-
net/tipc/node.c | 12 +-
net/xfrm/xfrm_user.c | 7 +-
sound/pci/hda/patch_realtek.c | 23 ++
sound/soc/codecs/wm8904.c | 7 +
sound/soc/qcom/lpass-cpu.c | 5 +-
tools/perf/util/auxtrace.c | 2 +-
58 files changed, 1015 insertions(+), 538 deletions(-)
Dear Linux folks,
Could you please apply commit 0c25422d34b4 (scsi: mpt3sas: Remove
scsi_dma_map() error messages) to the 5.15.y series?
commit 0c25422d34b4726b2707d5f38560943155a91b80
Author: Sreekanth Reddy <sreekanth.reddy(a)broadcom.com>
Date: Thu Mar 3 19:32:03 2022 +0530
scsi: mpt3sas: Remove scsi_dma_map() error messages
When scsi_dma_map() fails by returning a sges_left value less than
zero,
the amount of logging produced can be extremely high. In a recent
end-user
environment, 1200 messages per second were being sent to the log
buffer.
This eventually overwhelmed the system and it stalled.
These error messages are not needed. Remove them.
Link:
https://lore.kernel.org/r/20220303140203.12642-1-sreekanth.reddy@broadcom.c…
Suggested-by: Christoph Hellwig <hch(a)lst.de>
Signed-off-by: Sreekanth Reddy <sreekanth.reddy(a)broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen(a)oracle.com>
We see this regression after upgrading from Linux 5.10 to 5.15 on our
file servers with Broadcom/LSI SAS3008 PCI-Express Fusion-MPT SAS-3
(mpt3sas) – though luckily our systems do not stall/crash.
The commit message does not say anything about, what commit caused these
error to be appearing – the log statements have been there since
v4.20-rc1, if I am not mistaken, so it must be something else –, and
also do not mention, why these log messages are not needed, but the new
error condition is actually expected.
In the Canonical/Ubuntu bug tracker I found the explanation below [2].
> 2. mpt3sas: Remove scsi_dma_map errors messages:
> When driver set the DMA mask to 32bit then we observe that the
> SWIOTLB bounce buffers are getting exhausted quickly. For most of the
> IOs driver observe that scsi_dma_map() API returned with failure
> status and hence driver was printing below error message. Since this
> error message is getting printed per IO and if user issues heavy IOs
> then we observe that kernel overwhelmed with this error message. Also
> we will observe the kernel panic when the serial console is enabled.
> So to limit this issue, we removed this error message though this
> patch.
> "scsi_dma_map failed: request for 1310720 bytes!"
The Launchpad issue was created in March 2022, and the fixed Linux
kernel package 5.15.0-53.59 for Ubuntu 22.04 released on November 15th,
2022.
Sreekanth, looking again, you are the patch author, one of the Broadcom
maintainers (LSILOGIC MPT FUSION DRIVERS (FC/SAS/SPI)) and created the
Launchpad bug report. I am surprised you didn’t get it backported upstream.
Kind regards,
Paul
[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?…
[2]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1965927
"[Ubuntu 22.04] mpt3sas: Request to include latest bug fix patches"
This patch is fix for Linux kernel v2.6.33 or later.
For request subaction to IEC 61883-1 FCP region, Linux FireWire subsystem
have had an issue of use-after-free. The subsystem allows multiple
user space listeners to the region, while data of the payload was likely
released before the listeners execute read(2) to access to it for copying
to user space.
The issue was fixed by a commit 281e20323ab7 ("firewire: core: fix
use-after-free regression in FCP handler"). The object of payload is
duplicated in kernel space for each listener. When the listener executes
ioctl(2) with FW_CDEV_IOC_SEND_RESPONSE request, the object is going to
be released.
However, it causes memory leak since the commit relies on call of
release_request() in drivers/firewire/core-cdev.c. Against the
expectation, the function is never called due to the design of
release_client_resource(). The function delegates release task
to caller when called with non-NULL fourth argument. The implementation
of ioctl_send_response() is the case. It should release the object
explicitly.
This commit fixes the bug.
Cc: <stable(a)vger.kernel.org>
Fixes: 281e20323ab7 ("firewire: core: fix use-after-free regression in FCP handler")
Signed-off-by: Takashi Sakamoto <o-takashi(a)sakamocchi.jp>
---
drivers/firewire/core-cdev.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
index 9c89f7d53e99..958aa4662ccb 100644
--- a/drivers/firewire/core-cdev.c
+++ b/drivers/firewire/core-cdev.c
@@ -819,8 +819,10 @@ static int ioctl_send_response(struct client *client, union ioctl_arg *arg)
r = container_of(resource, struct inbound_transaction_resource,
resource);
- if (is_fcp_request(r->request))
+ if (is_fcp_request(r->request)) {
+ kfree(r->data);
goto out;
+ }
if (a->length != fw_get_response_length(r->request)) {
ret = -EINVAL;
--
2.37.2
From: Chris Wilson <chris.p.wilson(a)linux.intel.com>
Make sure that upon error after we have acquired the wakeref we do
release it again.
Fixes: 027c38b4121e ("drm/i915/selftests: Grab the runtime pm in shrink_thp")
Reviewed-by: Matthew Auld <matthew.auld(a)intel.com>
Signed-off-by: Chris Wilson <chris.p.wilson(a)linux.intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das(a)intel.com>
Cc: <stable(a)vger.kernel.org> # v6.0+
---
drivers/gpu/drm/i915/gem/selftests/huge_pages.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
index c281b0ec9e05..295d6f2cc4ff 100644
--- a/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
+++ b/drivers/gpu/drm/i915/gem/selftests/huge_pages.c
@@ -1855,7 +1855,7 @@ static int igt_shrink_thp(void *arg)
I915_SHRINK_ACTIVE);
i915_vma_unpin(vma);
if (err)
- goto out_put;
+ goto out_wf;
/*
* Now that the pages are *unpinned* shrinking should invoke
@@ -1871,7 +1871,7 @@ static int igt_shrink_thp(void *arg)
pr_err("unexpected pages mismatch, should_swap=%s\n",
str_yes_no(should_swap));
err = -EINVAL;
- goto out_put;
+ goto out_wf;
}
if (should_swap == (obj->mm.page_sizes.sg || obj->mm.page_sizes.phys)) {
@@ -1883,7 +1883,7 @@ static int igt_shrink_thp(void *arg)
err = i915_vma_pin(vma, 0, 0, flags);
if (err)
- goto out_put;
+ goto out_wf;
while (n--) {
err = cpu_check(obj, n, 0xdeadbeaf);
--
2.39.0
If we split a bio marked with REQ_NOWAIT, then we can trigger spurious
EAGAIN if constituent parts of that split bio end up failing request
allocations. Parts will complete just fine, but just a single failure
in one of the chained bios will yield an EAGAIN final result for the
parent bio.
Return EAGAIN early if we end up needing to split such a bio, which
allows for saner recovery handling.
Cc: stable(a)vger.kernel.org # 5.15+
Link: https://github.com/axboe/liburing/issues/766
Reported-by: Michael Kelley <mikelley(a)microsoft.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
---
block/blk-merge.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 071c5f8cf0cf..b7c193d67185 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -309,6 +309,16 @@ static struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
*segs = nsegs;
return NULL;
split:
+ /*
+ * We can't sanely support splitting for a REQ_NOWAIT bio. End it
+ * with EAGAIN if splitting is required and return an error pointer.
+ */
+ if (bio->bi_opf & REQ_NOWAIT) {
+ bio->bi_status = BLK_STS_AGAIN;
+ bio_endio(bio);
+ return ERR_PTR(-EAGAIN);
+ }
+
*segs = nsegs;
/*
--
2.39.0
The patch titled
Subject: mm: multi-gen LRU: fix crash during cgroup migration
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
mm-multi-gen-lru-fix-crash-during-cgroup-migration.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Yu Zhao <yuzhao(a)google.com>
Subject: mm: multi-gen LRU: fix crash during cgroup migration
Date: Sun, 15 Jan 2023 20:44:05 -0700
lru_gen_migrate_mm() assumes lru_gen_add_mm() runs prior to itself. This
isn't true for the following scenario:
CPU 1 CPU 2
clone()
cgroup_can_fork()
cgroup_procs_write()
cgroup_post_fork()
task_lock()
lru_gen_migrate_mm()
task_unlock()
task_lock()
lru_gen_add_mm()
task_unlock()
And when the above happens, kernel crashes because of linked list
corruption (mm_struct->lru_gen.list).
Link: https://lore.kernel.org/r/20230115134651.30028-1-msizanoen@qtmlabs.xyz/
Link: https://lkml.kernel.org/r/20230116034405.2960276-1-yuzhao@google.com
Fixes: bd74fdaea146 ("mm: multi-gen LRU: support page table walks")
Signed-off-by: Yu Zhao <yuzhao(a)google.com>
Reported-by: msizanoen <msizanoen(a)qtmlabs.xyz>
Tested-by: msizanoen <msizanoen(a)qtmlabs.xyz>
Cc: <stable(a)vger.kernel.org> [6.1+]
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/vmscan.c~mm-multi-gen-lru-fix-crash-during-cgroup-migration
+++ a/mm/vmscan.c
@@ -3323,13 +3323,16 @@ void lru_gen_migrate_mm(struct mm_struct
if (mem_cgroup_disabled())
return;
+ /* migration can happen before addition */
+ if (!mm->lru_gen.memcg)
+ return;
+
rcu_read_lock();
memcg = mem_cgroup_from_task(task);
rcu_read_unlock();
if (memcg == mm->lru_gen.memcg)
return;
- VM_WARN_ON_ONCE(!mm->lru_gen.memcg);
VM_WARN_ON_ONCE(list_empty(&mm->lru_gen.list));
lru_gen_del_mm(mm);
_
Patches currently in -mm which might be from yuzhao(a)google.com are
mm-multi-gen-lru-fix-crash-during-cgroup-migration.patch
mm-multi-gen-lru-rename-lru_gen_struct-to-lru_gen_folio.patch
mm-multi-gen-lru-rename-lrugen-lists-to-lrugen-folios.patch
mm-multi-gen-lru-remove-eviction-fairness-safeguard.patch
mm-multi-gen-lru-remove-aging-fairness-safeguard.patch
mm-multi-gen-lru-shuffle-should_run_aging.patch
mm-multi-gen-lru-per-node-lru_gen_folio-lists.patch
mm-multi-gen-lru-clarify-scan_control-flags.patch
mm-multi-gen-lru-simplify-arch_has_hw_pte_young-check.patch
mm-add-vma_has_recency.patch
mm-support-posix_fadv_noreuse.patch
There is a lock inversion and rwsem read-lock recursion in the devfreq
target callback which can lead to deadlocks.
Specifically, ufshcd_devfreq_scale() already holds a clk_scaling_lock
read lock when toggling the write booster, which involves taking the
dev_cmd mutex before taking another clk_scaling_lock read lock.
This can lead to a deadlock if another thread:
1) tries to acquire the dev_cmd and clk_scaling locks in the correct
order, or
2) takes a clk_scaling write lock before the attempt to take the
clk_scaling read lock a second time.
Fix this by dropping the clk_scaling_lock before toggling the write
booster as was done before commit 0e9d4ca43ba8 ("scsi: ufs: Protect some
contexts from unexpected clock scaling").
While the devfreq callbacks are already serialised, add a second
serialising mutex to handle the unlikely case where a callback triggered
through the devfreq sysfs interface is racing with a request to disable
clock scaling through the UFS controller 'clkscale_enable' sysfs
attribute. This could otherwise lead to the write booster being left
disabled after having disabled clock scaling.
Also take the new mutex in ufshcd_clk_scaling_allow() to make sure that
any pending write booster update has completed on return.
Note that this currently only affects Qualcomm platforms since commit
87bd05016a64 ("scsi: ufs: core: Allow host driver to disable wb toggling
during clock scaling").
The lock inversion (i.e. 1 above) was reported by lockdep as:
======================================================
WARNING: possible circular locking dependency detected
6.1.0-next-20221216 #211 Not tainted
------------------------------------------------------
kworker/u16:2/71 is trying to acquire lock:
ffff076280ba98a0 (&hba->dev_cmd.lock){+.+.}-{3:3}, at: ufshcd_query_flag+0x50/0x1c0
but task is already holding lock:
ffff076280ba9cf0 (&hba->clk_scaling_lock){++++}-{3:3}, at: ufshcd_devfreq_scale+0x2b8/0x380
which lock already depends on the new lock.
[ +0.011606]
the existing dependency chain (in reverse order) is:
-> #1 (&hba->clk_scaling_lock){++++}-{3:3}:
lock_acquire+0x68/0x90
down_read+0x58/0x80
ufshcd_exec_dev_cmd+0x70/0x2c0
ufshcd_verify_dev_init+0x68/0x170
ufshcd_probe_hba+0x398/0x1180
ufshcd_async_scan+0x30/0x320
async_run_entry_fn+0x34/0x150
process_one_work+0x288/0x6c0
worker_thread+0x74/0x450
kthread+0x118/0x120
ret_from_fork+0x10/0x20
-> #0 (&hba->dev_cmd.lock){+.+.}-{3:3}:
__lock_acquire+0x12a0/0x2240
lock_acquire.part.0+0xcc/0x220
lock_acquire+0x68/0x90
__mutex_lock+0x98/0x430
mutex_lock_nested+0x2c/0x40
ufshcd_query_flag+0x50/0x1c0
ufshcd_query_flag_retry+0x64/0x100
ufshcd_wb_toggle+0x5c/0x120
ufshcd_devfreq_scale+0x2c4/0x380
ufshcd_devfreq_target+0xf4/0x230
devfreq_set_target+0x84/0x2f0
devfreq_update_target+0xc4/0xf0
devfreq_monitor+0x38/0x1f0
process_one_work+0x288/0x6c0
worker_thread+0x74/0x450
kthread+0x118/0x120
ret_from_fork+0x10/0x20
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&hba->clk_scaling_lock);
lock(&hba->dev_cmd.lock);
lock(&hba->clk_scaling_lock);
lock(&hba->dev_cmd.lock);
*** DEADLOCK ***
Fixes: 0e9d4ca43ba8 ("scsi: ufs: Protect some contexts from unexpected clock scaling")
Cc: stable(a)vger.kernel.org # 5.12
Cc: Can Guo <quic_cang(a)quicinc.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
---
This issue has apparently been known for over a year [1] without anyone
bothering to fix it. Aside from the potential deadlocks, this also leads
to developers using Qualcomm platforms not being able to use lockdep to
prevent further issues like this from being introduced in other places.
Johan
[1] https://lore.kernel.org/lkml/1631843521-2863-1-git-send-email-cang@codeauro…
drivers/ufs/core/ufshcd.c | 29 +++++++++++++++--------------
include/ufs/ufshcd.h | 2 ++
2 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index bda61be5f035..5c3821b2fcf8 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -1234,12 +1234,14 @@ static int ufshcd_clock_scaling_prepare(struct ufs_hba *hba)
* clock scaling is in progress
*/
ufshcd_scsi_block_requests(hba);
+ mutex_lock(&hba->wb_mutex);
down_write(&hba->clk_scaling_lock);
if (!hba->clk_scaling.is_allowed ||
ufshcd_wait_for_doorbell_clr(hba, DOORBELL_CLR_TOUT_US)) {
ret = -EBUSY;
up_write(&hba->clk_scaling_lock);
+ mutex_unlock(&hba->wb_mutex);
ufshcd_scsi_unblock_requests(hba);
goto out;
}
@@ -1251,12 +1253,16 @@ static int ufshcd_clock_scaling_prepare(struct ufs_hba *hba)
return ret;
}
-static void ufshcd_clock_scaling_unprepare(struct ufs_hba *hba, bool writelock)
+static void ufshcd_clock_scaling_unprepare(struct ufs_hba *hba, bool scale_up)
{
- if (writelock)
- up_write(&hba->clk_scaling_lock);
- else
- up_read(&hba->clk_scaling_lock);
+ up_write(&hba->clk_scaling_lock);
+
+ /* Enable Write Booster if we have scaled up else disable it */
+ if (ufshcd_enable_wb_if_scaling_up(hba))
+ ufshcd_wb_toggle(hba, scale_up);
+
+ mutex_unlock(&hba->wb_mutex);
+
ufshcd_scsi_unblock_requests(hba);
ufshcd_release(hba);
}
@@ -1273,7 +1279,6 @@ static void ufshcd_clock_scaling_unprepare(struct ufs_hba *hba, bool writelock)
static int ufshcd_devfreq_scale(struct ufs_hba *hba, bool scale_up)
{
int ret = 0;
- bool is_writelock = true;
ret = ufshcd_clock_scaling_prepare(hba);
if (ret)
@@ -1302,15 +1307,8 @@ static int ufshcd_devfreq_scale(struct ufs_hba *hba, bool scale_up)
}
}
- /* Enable Write Booster if we have scaled up else disable it */
- if (ufshcd_enable_wb_if_scaling_up(hba)) {
- downgrade_write(&hba->clk_scaling_lock);
- is_writelock = false;
- ufshcd_wb_toggle(hba, scale_up);
- }
-
out_unprepare:
- ufshcd_clock_scaling_unprepare(hba, is_writelock);
+ ufshcd_clock_scaling_unprepare(hba, scale_up);
return ret;
}
@@ -6066,9 +6064,11 @@ static void ufshcd_force_error_recovery(struct ufs_hba *hba)
static void ufshcd_clk_scaling_allow(struct ufs_hba *hba, bool allow)
{
+ mutex_lock(&hba->wb_mutex);
down_write(&hba->clk_scaling_lock);
hba->clk_scaling.is_allowed = allow;
up_write(&hba->clk_scaling_lock);
+ mutex_unlock(&hba->wb_mutex);
}
static void ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend)
@@ -9793,6 +9793,7 @@ int ufshcd_init(struct ufs_hba *hba, void __iomem *mmio_base, unsigned int irq)
/* Initialize mutex for exception event control */
mutex_init(&hba->ee_ctrl_mutex);
+ mutex_init(&hba->wb_mutex);
init_rwsem(&hba->clk_scaling_lock);
ufshcd_init_clk_gating(hba);
diff --git a/include/ufs/ufshcd.h b/include/ufs/ufshcd.h
index 5cf81dff60aa..727084cd79be 100644
--- a/include/ufs/ufshcd.h
+++ b/include/ufs/ufshcd.h
@@ -808,6 +808,7 @@ struct ufs_hba_monitor {
* @urgent_bkops_lvl: keeps track of urgent bkops level for device
* @is_urgent_bkops_lvl_checked: keeps track if the urgent bkops level for
* device is known or not.
+ * @wb_mutex: used to serialize devfreq and sysfs write booster toggling
* @clk_scaling_lock: used to serialize device commands and clock scaling
* @desc_size: descriptor sizes reported by device
* @scsi_block_reqs_cnt: reference counting for scsi block requests
@@ -951,6 +952,7 @@ struct ufs_hba {
enum bkops_status urgent_bkops_lvl;
bool is_urgent_bkops_lvl_checked;
+ struct mutex wb_mutex;
struct rw_semaphore clk_scaling_lock;
unsigned char desc_size[QUERY_DESC_IDN_MAX];
atomic_t scsi_block_reqs_cnt;
--
2.37.4
[Public]
Hi,
A patch was introduced into 6.2 that will allow outputting what GPIO is active at a given time for pinctrl-amd with dynamic debugging. This patch is useful particularly for debugging the wakeup source when it says that it's the GPIO controller.
I've asked at least 4 people to apply it on 6.1.y kernels and turn on those statements in /sys/kernel/debug/dynamic_debug/control to debug issues of spurious wakeups which has me realizing it would make sense to just generally being in 6.1.y.
Would you please cherry-pick this commit?
1d66e379731f ("pinctrl: amd: Add dynamic debugging for active GPIOs")
Thanks,
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
544d163d659d ("io_uring: lock overflowing for IOPOLL")
a8cf95f93610 ("io_uring: fix overflow handling regression")
fa18fa2272c7 ("io_uring: inline __io_req_complete_put()")
f9d567c75ec2 ("io_uring: inline __io_req_complete_post()")
52120f0fadcb ("io_uring: add allow_overflow to io_post_aux_cqe")
e6130eba8a84 ("io_uring: add support for passing fixed file descriptors")
253993210bd8 ("io_uring: introduce locking helpers for CQE posting")
305bef988708 ("io_uring: hide eventfd assumptions in eventfd paths")
d9dee4302a7c ("io_uring: remove ->flush_cqes optimisation")
a830ffd28780 ("io_uring: move io_eventfd_signal()")
9046c6415be6 ("io_uring: reshuffle io_uring/io_uring.h")
d142c3ec8d16 ("io_uring: remove extra io_commit_cqring()")
68494a65d0e2 ("io_uring: introduce io_req_cqe_overflow()")
faf88dde060f ("io_uring: don't inline __io_get_cqe()")
d245bca6375b ("io_uring: don't expose io_fill_cqe_aux()")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 544d163d659d45a206d8929370d5a2984e546cb7 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Thu, 12 Jan 2023 13:08:56 +0000
Subject: [PATCH] io_uring: lock overflowing for IOPOLL
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
syzbot reports an issue with overflow filling for IOPOLL:
WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
io_fill_cqe_req io_uring/io_uring.h:168 [inline]
io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863
There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.
Reported-and-tested-by: syzbot+6805087452d72929404e(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 8227af2e1c0f..9c3ddd46a1ad 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1062,7 +1062,11 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
continue;
req->cqe.flags = io_put_kbuf(req, 0);
- io_fill_cqe_req(req->ctx, req);
+ if (unlikely(!__io_fill_cqe_req(ctx, req))) {
+ spin_lock(&ctx->completion_lock);
+ io_req_cqe_overflow(req);
+ spin_unlock(&ctx->completion_lock);
+ }
}
if (unlikely(!nr_events))
On Mon, Jan 16, 2023 at 06:12:12AM +0530, mkv22(a)cantab.net wrote:
> Apologies that I am unable to attach more detailed information from dmesg
> or logs as the update was in a production laptop and had to be rolled back
> to 6.1.5 immediately and lost the dmesg output.
Any chance to run 'git bisect' to find the offending change?
thanks,
greg k-h
USB3 ports on xHC hosts may have retimers that cause too long
exit latency to work with native USB3 U1/U2 link power management states.
For now only use usb_acpi_port_lpm_incapable() to evaluate if port lpm
should be disabled while setting up the USB3 roothub.
Other ways to identify lpm incapable ports can be added here later if
ACPI _DSM does not exist.
Limit this to Intel hosts for now, this is to my knowledge only
an Intel issue.
Cc: stable(a)vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-pci.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)
diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index b5016709b26f..fb988e4ea924 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -355,8 +355,38 @@ static void xhci_pme_acpi_rtd3_enable(struct pci_dev *dev)
NULL);
ACPI_FREE(obj);
}
+
+static void xhci_find_lpm_incapable_ports(struct usb_hcd *hcd, struct usb_device *hdev)
+{
+ struct xhci_hcd *xhci = hcd_to_xhci(hcd);
+ struct xhci_hub *rhub = &xhci->usb3_rhub;
+ int ret;
+ int i;
+
+ /* This is not the usb3 roothub we are looking for */
+ if (hcd != rhub->hcd)
+ return;
+
+ if (hdev->maxchild > rhub->num_ports) {
+ dev_err(&hdev->dev, "USB3 roothub port number mismatch\n");
+ return;
+ }
+
+ for (i = 0; i < hdev->maxchild; i++) {
+ ret = usb_acpi_port_lpm_incapable(hdev, i);
+
+ dev_dbg(&hdev->dev, "port-%d disable U1/U2 _DSM: %d\n", i + 1, ret);
+
+ if (ret >= 0) {
+ rhub->ports[i]->lpm_incapable = ret;
+ continue;
+ }
+ }
+}
+
#else
static void xhci_pme_acpi_rtd3_enable(struct pci_dev *dev) { }
+static void xhci_find_lpm_incapable_ports(struct usb_hcd *hcd, struct usb_device *hdev) { }
#endif /* CONFIG_ACPI */
/* called during probe() after chip reset completes */
@@ -392,6 +422,10 @@ static int xhci_pci_setup(struct usb_hcd *hcd)
static int xhci_pci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
struct usb_tt *tt, gfp_t mem_flags)
{
+ /* Check if acpi claims some USB3 roothub ports are lpm incapable */
+ if (!hdev->parent)
+ xhci_find_lpm_incapable_ports(hcd, hdev);
+
return xhci_update_hub_device(hcd, hdev, tt, mem_flags);
}
--
2.25.1
Add a helper to evaluate ACPI usb device specific method (_DSM) provided
in case the USB3 port shouldn't enter U1 and U2 link states.
This _DSM was added as port specific retimer configuration may lead to
exit latencies growing beyond U1/U2 exit limits, and OS needs a way to
find which ports can't support U1/U2 link power management states.
This _DSM is also used by windows:
Link: https://docs.microsoft.com/en-us/windows-hardware/drivers/bringup/usb-devic…
Some patch issues found in testing resolved by Ron Lee
Cc: stable(a)vger.kernel.org
Tested-by: Ron Lee <ron.lee(a)intel.com>
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/core/usb-acpi.c | 65 +++++++++++++++++++++++++++++++++++++
include/linux/usb.h | 3 ++
2 files changed, 68 insertions(+)
diff --git a/drivers/usb/core/usb-acpi.c b/drivers/usb/core/usb-acpi.c
index 6d93428432f1..533baa85083c 100644
--- a/drivers/usb/core/usb-acpi.c
+++ b/drivers/usb/core/usb-acpi.c
@@ -37,6 +37,71 @@ bool usb_acpi_power_manageable(struct usb_device *hdev, int index)
}
EXPORT_SYMBOL_GPL(usb_acpi_power_manageable);
+#define UUID_USB_CONTROLLER_DSM "ce2ee385-00e6-48cb-9f05-2edb927c4899"
+#define USB_DSM_DISABLE_U1_U2_FOR_PORT 5
+
+/**
+ * usb_acpi_port_lpm_incapable - check if lpm should be disabled for a port.
+ * @hdev: USB device belonging to the usb hub
+ * @index: zero based port index
+ *
+ * Some USB3 ports may not support USB3 link power management U1/U2 states
+ * due to different retimer setup. ACPI provides _DSM method which returns 0x01
+ * if U1 and U2 states should be disabled. Evaluate _DSM with:
+ * Arg0: UUID = ce2ee385-00e6-48cb-9f05-2edb927c4899
+ * Arg1: Revision ID = 0
+ * Arg2: Function Index = 5
+ * Arg3: (empty)
+ *
+ * Return 1 if USB3 port is LPM incapable, negative on error, otherwise 0
+ */
+
+int usb_acpi_port_lpm_incapable(struct usb_device *hdev, int index)
+{
+ union acpi_object *obj;
+ acpi_handle port_handle;
+ int port1 = index + 1;
+ guid_t guid;
+ int ret;
+
+ ret = guid_parse(UUID_USB_CONTROLLER_DSM, &guid);
+ if (ret)
+ return ret;
+
+ port_handle = usb_get_hub_port_acpi_handle(hdev, port1);
+ if (!port_handle) {
+ dev_dbg(&hdev->dev, "port-%d no acpi handle\n", port1);
+ return -ENODEV;
+ }
+
+ if (!acpi_check_dsm(port_handle, &guid, 0,
+ BIT(USB_DSM_DISABLE_U1_U2_FOR_PORT))) {
+ dev_dbg(&hdev->dev, "port-%d no _DSM function %d\n",
+ port1, USB_DSM_DISABLE_U1_U2_FOR_PORT);
+ return -ENODEV;
+ }
+
+ obj = acpi_evaluate_dsm(port_handle, &guid, 0,
+ USB_DSM_DISABLE_U1_U2_FOR_PORT, NULL);
+
+ if (!obj)
+ return -ENODEV;
+
+ if (obj->type != ACPI_TYPE_INTEGER) {
+ dev_dbg(&hdev->dev, "evaluate port-%d _DSM failed\n", port1);
+ ACPI_FREE(obj);
+ return -EINVAL;
+ }
+
+ if (obj->integer.value == 0x01)
+ ret = 1;
+
+ ACPI_FREE(obj);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(usb_acpi_port_lpm_incapable);
+
/**
* usb_acpi_set_power_state - control usb port's power via acpi power
* resource
diff --git a/include/linux/usb.h b/include/linux/usb.h
index 7d5325d47c45..04a7e94fb772 100644
--- a/include/linux/usb.h
+++ b/include/linux/usb.h
@@ -774,11 +774,14 @@ extern struct device *usb_intf_get_dma_device(struct usb_interface *intf);
extern int usb_acpi_set_power_state(struct usb_device *hdev, int index,
bool enable);
extern bool usb_acpi_power_manageable(struct usb_device *hdev, int index);
+extern int usb_acpi_port_lpm_incapable(struct usb_device *hdev, int index);
#else
static inline int usb_acpi_set_power_state(struct usb_device *hdev, int index,
bool enable) { return 0; }
static inline bool usb_acpi_power_manageable(struct usb_device *hdev, int index)
{ return true; }
+static inline int usb_acpi_port_lpm_incapable(struct usb_device *hdev, int index)
+ { return 0; }
#endif
/* USB autosuspend and autoresume */
--
2.25.1
One USB3 roothub port may support link power management, while another
root port on the same xHC can't due to different retimers used for
the ports.
This is the case with Intel Alder Lake, and possible future platforms
where retimers used for USB4 ports cause too long exit latecy to
enable native USB3 lpm U1 and U2 states.
Add a flag in the xhci port structure to indicate if the port is
lpm_incapable, and check it while calculating exit latency.
Cc: stable(a)vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci.c | 8 ++++++++
drivers/usb/host/xhci.h | 1 +
2 files changed, 9 insertions(+)
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 89f92fc78bb1..2b280beb0011 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -5049,6 +5049,7 @@ static int xhci_enable_usb3_lpm_timeout(struct usb_hcd *hcd,
struct usb_device *udev, enum usb3_link_state state)
{
struct xhci_hcd *xhci;
+ struct xhci_port *port;
u16 hub_encoded_timeout;
int mel;
int ret;
@@ -5065,6 +5066,13 @@ static int xhci_enable_usb3_lpm_timeout(struct usb_hcd *hcd,
if (xhci_check_tier_policy(xhci, udev, state) < 0)
return USB3_LPM_DISABLED;
+ /* If connected to root port then check port can handle lpm */
+ if (udev->parent && !udev->parent->parent) {
+ port = xhci->usb3_rhub.ports[udev->portnum - 1];
+ if (port->lpm_incapable)
+ return USB3_LPM_DISABLED;
+ }
+
hub_encoded_timeout = xhci_calculate_lpm_timeout(hcd, udev, state);
mel = calculate_max_exit_latency(udev, state, hub_encoded_timeout);
if (mel < 0) {
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 3edfacb93817..dcee7f3207ad 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1735,6 +1735,7 @@ struct xhci_port {
int hcd_portnum;
struct xhci_hub *rhub;
struct xhci_port_cap *port_cap;
+ unsigned int lpm_incapable:1;
};
struct xhci_hub {
--
2.25.1
Allow PCI hosts to check and tune roothub and port settings
before the hub is up and running.
This override is needed to turn off U1 and U2 LPM for some ports
based on per port ACPI _DSM, _UPC, or possibly vendor specific mmio
values for Intel xHC hosts.
Usb core calls the host update_hub_device once it creates a hub.
Entering U1 or U2 link power save state on ports with this limitation
will cause link to fail, turning the usb device unusable in that setup.
Cc: stable(a)vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci-pci.c | 9 +++++++++
drivers/usb/host/xhci.c | 5 ++++-
drivers/usb/host/xhci.h | 4 ++++
3 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/drivers/usb/host/xhci-pci.c b/drivers/usb/host/xhci-pci.c
index 2c0d7038f040..b5016709b26f 100644
--- a/drivers/usb/host/xhci-pci.c
+++ b/drivers/usb/host/xhci-pci.c
@@ -78,9 +78,12 @@ static const char hcd_name[] = "xhci_hcd";
static struct hc_driver __read_mostly xhci_pci_hc_driver;
static int xhci_pci_setup(struct usb_hcd *hcd);
+static int xhci_pci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
+ struct usb_tt *tt, gfp_t mem_flags);
static const struct xhci_driver_overrides xhci_pci_overrides __initconst = {
.reset = xhci_pci_setup,
+ .update_hub_device = xhci_pci_update_hub_device,
};
/* called after powerup, by probe or system-pm "wakeup" */
@@ -386,6 +389,12 @@ static int xhci_pci_setup(struct usb_hcd *hcd)
return xhci_pci_reinit(xhci, pdev);
}
+static int xhci_pci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
+ struct usb_tt *tt, gfp_t mem_flags)
+{
+ return xhci_update_hub_device(hcd, hdev, tt, mem_flags);
+}
+
/*
* We need to register our own PCI probe function (instead of the USB core's
* function) in order to create a second roothub under xHCI.
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 50b41213e827..89f92fc78bb1 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -5124,7 +5124,7 @@ static int xhci_disable_usb3_lpm_timeout(struct usb_hcd *hcd,
/* Once a hub descriptor is fetched for a device, we need to update the xHC's
* internal data structures for the device.
*/
-static int xhci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
+int xhci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
struct usb_tt *tt, gfp_t mem_flags)
{
struct xhci_hcd *xhci = hcd_to_xhci(hcd);
@@ -5224,6 +5224,7 @@ static int xhci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
xhci_free_command(xhci, config_cmd);
return ret;
}
+EXPORT_SYMBOL_GPL(xhci_update_hub_device);
static int xhci_get_frame(struct usb_hcd *hcd)
{
@@ -5507,6 +5508,8 @@ void xhci_init_driver(struct hc_driver *drv,
drv->check_bandwidth = over->check_bandwidth;
if (over->reset_bandwidth)
drv->reset_bandwidth = over->reset_bandwidth;
+ if (over->update_hub_device)
+ drv->update_hub_device = over->update_hub_device;
}
}
EXPORT_SYMBOL_GPL(xhci_init_driver);
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index c9f06c5e4e9d..3edfacb93817 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1943,6 +1943,8 @@ struct xhci_driver_overrides {
struct usb_host_endpoint *ep);
int (*check_bandwidth)(struct usb_hcd *, struct usb_device *);
void (*reset_bandwidth)(struct usb_hcd *, struct usb_device *);
+ int (*update_hub_device)(struct usb_hcd *hcd, struct usb_device *hdev,
+ struct usb_tt *tt, gfp_t mem_flags);
};
#define XHCI_CFC_DELAY 10
@@ -2122,6 +2124,8 @@ int xhci_drop_endpoint(struct usb_hcd *hcd, struct usb_device *udev,
struct usb_host_endpoint *ep);
int xhci_check_bandwidth(struct usb_hcd *hcd, struct usb_device *udev);
void xhci_reset_bandwidth(struct usb_hcd *hcd, struct usb_device *udev);
+int xhci_update_hub_device(struct usb_hcd *hcd, struct usb_device *hdev,
+ struct usb_tt *tt, gfp_t mem_flags);
int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id);
int xhci_ext_cap_init(struct xhci_hcd *xhci);
--
2.25.1
Make sure xhci_free_dev() and xhci_kill_endpoint_urbs() do not race
and cause null pointer dereference when host suddenly dies.
Usb core may call xhci_free_dev() which frees the xhci->devs[slot_id]
virt device at the same time that xhci_kill_endpoint_urbs() tries to
loop through all the device's endpoints, checking if there are any
cancelled urbs left to give back.
hold the xhci spinlock while freeing the virt device
Cc: stable(a)vger.kernel.org
Signed-off-by: Mathias Nyman <mathias.nyman(a)linux.intel.com>
---
drivers/usb/host/xhci.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 79d7931c048a..50b41213e827 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -3974,6 +3974,7 @@ static void xhci_free_dev(struct usb_hcd *hcd, struct usb_device *udev)
struct xhci_hcd *xhci = hcd_to_xhci(hcd);
struct xhci_virt_device *virt_dev;
struct xhci_slot_ctx *slot_ctx;
+ unsigned long flags;
int i, ret;
/*
@@ -4000,7 +4001,11 @@ static void xhci_free_dev(struct usb_hcd *hcd, struct usb_device *udev)
virt_dev->eps[i].ep_state &= ~EP_STOP_CMD_PENDING;
virt_dev->udev = NULL;
xhci_disable_slot(xhci, udev->slot_id);
+
+ spin_lock_irqsave(&xhci->lock, flags);
xhci_free_virt_device(xhci, udev->slot_id);
+ spin_unlock_irqrestore(&xhci->lock, flags);
+
}
int xhci_disable_slot(struct xhci_hcd *xhci, u32 slot_id)
--
2.25.1
Set the framebuffer info for drivers that support VGA switcheroo. Only
affects the amdgpu and nouveau drivers, which use VGA switcheroo and
generic fbdev emulation. For other drivers, this does nothing.
This fixes a potential regression in the console code. Both, amdgpu and
nouveau, invoked vga_switcheroo_client_fb_set() from their internal fbdev
code. But the call got lost when the drivers switched to the generic
emulation.
Fixes: 087451f372bf ("drm/amdgpu: use generic fb helpers instead of setting up AMD own's.")
Fixes: 4a16dd9d18a0 ("drm/nouveau/kms: switch to drm fbdev helpers")
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Reviewed-by: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Reviewed-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: Ben Skeggs <bskeggs(a)redhat.com>
Cc: Karol Herbst <kherbst(a)redhat.com>
Cc: Lyude Paul <lyude(a)redhat.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Javier Martinez Canillas <javierm(a)redhat.com>
Cc: Laurent Pinchart <laurent.pinchart(a)ideasonboard.com>
Cc: Jani Nikula <jani.nikula(a)intel.com>
Cc: Dave Airlie <airlied(a)redhat.com>
Cc: Evan Quan <evan.quan(a)amd.com>
Cc: Christian König <christian.koenig(a)amd.com>
Cc: Alex Deucher <alexander.deucher(a)amd.com>
Cc: Hawking Zhang <Hawking.Zhang(a)amd.com>
Cc: Likun Gao <Likun.Gao(a)amd.com>
Cc: "Christian König" <christian.koenig(a)amd.com>
Cc: Stanley Yang <Stanley.Yang(a)amd.com>
Cc: "Tianci.Yin" <tianci.yin(a)amd.com>
Cc: Xiaojian Du <Xiaojian.Du(a)amd.com>
Cc: Andrey Grodzovsky <andrey.grodzovsky(a)amd.com>
Cc: YiPeng Chai <YiPeng.Chai(a)amd.com>
Cc: Somalapuram Amaranath <Amaranath.Somalapuram(a)amd.com>
Cc: Bokun Zhang <Bokun.Zhang(a)amd.com>
Cc: Guchun Chen <guchun.chen(a)amd.com>
Cc: Hamza Mahfooz <hamza.mahfooz(a)amd.com>
Cc: Aurabindo Pillai <aurabindo.pillai(a)amd.com>
Cc: Mario Limonciello <mario.limonciello(a)amd.com>
Cc: Solomon Chiu <solomon.chiu(a)amd.com>
Cc: Kai-Heng Feng <kai.heng.feng(a)canonical.com>
Cc: Felix Kuehling <Felix.Kuehling(a)amd.com>
Cc: Daniel Vetter <daniel.vetter(a)ffwll.ch>
Cc: "Marek Olšák" <marek.olsak(a)amd.com>
Cc: Sam Ravnborg <sam(a)ravnborg.org>
Cc: Hans de Goede <hdegoede(a)redhat.com>
Cc: "Ville Syrjälä" <ville.syrjala(a)linux.intel.com>
Cc: dri-devel(a)lists.freedesktop.org
Cc: nouveau(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v5.17+
---
drivers/gpu/drm/drm_fb_helper.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/gpu/drm/drm_fb_helper.c b/drivers/gpu/drm/drm_fb_helper.c
index 367fb8b2d5fa..c5c13e192b64 100644
--- a/drivers/gpu/drm/drm_fb_helper.c
+++ b/drivers/gpu/drm/drm_fb_helper.c
@@ -30,7 +30,9 @@
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
#include <linux/console.h>
+#include <linux/pci.h>
#include <linux/sysrq.h>
+#include <linux/vga_switcheroo.h>
#include <drm/drm_atomic.h>
#include <drm/drm_drv.h>
@@ -1924,6 +1926,7 @@ static int drm_fb_helper_single_fb_probe(struct drm_fb_helper *fb_helper,
int preferred_bpp)
{
struct drm_client_dev *client = &fb_helper->client;
+ struct drm_device *dev = fb_helper->dev;
struct drm_fb_helper_surface_size sizes;
int ret;
@@ -1945,6 +1948,11 @@ static int drm_fb_helper_single_fb_probe(struct drm_fb_helper *fb_helper,
return ret;
strcpy(fb_helper->fb->comm, "[fbcon]");
+
+ /* Set the fb info for vgaswitcheroo clients. Does nothing otherwise. */
+ if (dev_is_pci(dev->dev))
+ vga_switcheroo_client_fb_set(to_pci_dev(dev->dev), fb_helper->info);
+
return 0;
}
--
2.39.0
Always allow switching away via vga-switcheroo if the display is
uninitalized. Instead prevent switching to i915 if the device has
not been initialized.
This issue was introduced by commit 5df7bd130818 ("drm/i915: skip
display initialization when there is no display") protected, which
protects code paths from being executed on uninitialized devices.
In the case of vga-switcheroo, we want to allow a switch away from
i915's device. So run vga_switcheroo_process_delayed_switch() and
test in the switcheroo callbacks if the i915 device is available.
Fixes: 5df7bd130818 ("drm/i915: skip display initialization when there is no display")
Signed-off-by: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Radhakrishna Sripada <radhakrishna.sripada(a)intel.com>
Cc: Lucas De Marchi <lucas.demarchi(a)intel.com>
Cc: José Roberto de Souza <jose.souza(a)intel.com>
Cc: Jani Nikula <jani.nikula(a)intel.com>
Cc: Ville Syrjälä <ville.syrjala(a)linux.intel.com>
Cc: Jani Nikula <jani.nikula(a)linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen(a)linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi(a)intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin(a)linux.intel.com>
Cc: "Ville Syrjälä" <ville.syrjala(a)linux.intel.com>
Cc: Manasi Navare <manasi.d.navare(a)intel.com>
Cc: Stanislav Lisovskiy <stanislav.lisovskiy(a)intel.com>
Cc: Imre Deak <imre.deak(a)intel.com>
Cc: "Jouni Högander" <jouni.hogander(a)intel.com>
Cc: Uma Shankar <uma.shankar(a)intel.com>
Cc: Ankit Nautiyal <ankit.k.nautiyal(a)intel.com>
Cc: "Jason A. Donenfeld" <Jason(a)zx2c4.com>
Cc: Matt Roper <matthew.d.roper(a)intel.com>
Cc: Ramalingam C <ramalingam.c(a)intel.com>
Cc: Thomas Zimmermann <tzimmermann(a)suse.de>
Cc: Andi Shyti <andi.shyti(a)linux.intel.com>
Cc: Andrzej Hajda <andrzej.hajda(a)intel.com>
Cc: "José Roberto de Souza" <jose.souza(a)intel.com>
Cc: Julia Lawall <Julia.Lawall(a)inria.fr>
Cc: intel-gfx(a)lists.freedesktop.org
Cc: <stable(a)vger.kernel.org> # v5.14+
---
drivers/gpu/drm/i915/i915_driver.c | 3 +--
drivers/gpu/drm/i915/i915_switcheroo.c | 6 +++++-
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_driver.c b/drivers/gpu/drm/i915/i915_driver.c
index c1e427ba57ae..33e231b120c1 100644
--- a/drivers/gpu/drm/i915/i915_driver.c
+++ b/drivers/gpu/drm/i915/i915_driver.c
@@ -1075,8 +1075,7 @@ static void i915_driver_lastclose(struct drm_device *dev)
intel_fbdev_restore_mode(dev);
- if (HAS_DISPLAY(i915))
- vga_switcheroo_process_delayed_switch();
+ vga_switcheroo_process_delayed_switch();
}
static void i915_driver_postclose(struct drm_device *dev, struct drm_file *file)
diff --git a/drivers/gpu/drm/i915/i915_switcheroo.c b/drivers/gpu/drm/i915/i915_switcheroo.c
index 23777d500cdf..f45bd6b6cede 100644
--- a/drivers/gpu/drm/i915/i915_switcheroo.c
+++ b/drivers/gpu/drm/i915/i915_switcheroo.c
@@ -19,6 +19,10 @@ static void i915_switcheroo_set_state(struct pci_dev *pdev,
dev_err(&pdev->dev, "DRM not initialized, aborting switch.\n");
return;
}
+ if (!HAS_DISPLAY(i915)) {
+ dev_err(&pdev->dev, "Device state not initialized, aborting switch.\n");
+ return;
+ }
if (state == VGA_SWITCHEROO_ON) {
drm_info(&i915->drm, "switched on\n");
@@ -44,7 +48,7 @@ static bool i915_switcheroo_can_switch(struct pci_dev *pdev)
* locking inversion with the driver load path. And the access here is
* completely racy anyway. So don't bother with locking for now.
*/
- return i915 && atomic_read(&i915->drm.open_count) == 0;
+ return i915 && HAS_DISPLAY(i915) && atomic_read(&i915->drm.open_count) == 0;
}
static const struct vga_switcheroo_client_ops i915_switcheroo_ops = {
--
2.39.0
The introduction of support for Apple board types inadvertently changed
the precedence order, causing hybrid SMBIOS+DT platforms to look up the
firmware using the DMI information instead of the device tree compatible
to generate the board type. Revert back to the old behavior,
as affected platforms use firmwares named after the DT compatible.
Fixes: 7682de8b3351 ("wifi: brcmfmac: of: Fetch Apple properties")
[1] https://bugzilla.opensuse.org/show_bug.cgi?id=1206697#c13
Cc: stable(a)vger.kernel.org
Signed-off-by: Ivan T. Ivanov <iivanov(a)suse.de>
Reviewed-by: Hector Martin <marcan(a)marcan.st>
---
Changes since v1
Rewrite commit message according feedback.
https://lore.kernel.org/all/20230106072746.29516-1-iivanov@suse.de/
drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c
index a83699de01ec..fdd0c9abc1a1 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/of.c
@@ -79,7 +79,8 @@ void brcmf_of_probe(struct device *dev, enum brcmf_bus_type bus_type,
/* Apple ARM64 platforms have their own idea of board type, passed in
* via the device tree. They also have an antenna SKU parameter
*/
- if (!of_property_read_string(np, "brcm,board-type", &prop))
+ err = of_property_read_string(np, "brcm,board-type", &prop);
+ if (!err)
settings->board_type = prop;
if (!of_property_read_string(np, "apple,antenna-sku", &prop))
@@ -87,7 +88,7 @@ void brcmf_of_probe(struct device *dev, enum brcmf_bus_type bus_type,
/* Set board-type to the first string of the machine compatible prop */
root = of_find_node_by_path("/");
- if (root && !settings->board_type) {
+ if (root && err) {
char *board_type;
const char *tmp;
--
2.35.3
This bug is marked as fixed by commit:
net: core: netlink: add helper refcount dec and lock function
net: sched: add helper function to take reference to Qdisc
net: sched: extend Qdisc with rcu
net: sched: rename qdisc_destroy() to qdisc_put()
net: sched: use Qdisc rcu API instead of relying on rtnl lock
But I can't find it in the tested trees[1] for more than 90 days.
Is it a correct commit? Please update it by replying:
#syz fix: exact-commit-title
Until then the bug is still considered open and new crashes with
the same signature are ignored.
Kernel: Linux 4.19
Dashboard link: https://syzkaller.appspot.com/bug?extid=5f229e48cccc804062c0
---
[1] I expect the commit to be present in:
1. linux-4.19.y branch of
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
After doorbell DMA fetches the TRB. If during dequeuing request
driver changes NORMAL TRB to LINK TRB but doesn't delete it from
controller cache then controller will handle cached TRB and packet
can be lost.
The example scenario for this issue looks like:
1. queue request - set doorbell
2. dequeue request
3. send OUT data packet from host
4. Device will accept this packet which is unexpected
5. queue new request - set doorbell
6. Device lost the expected packet.
By setting DFLUSH controller clears DRDY bit and stop DMA transfer.
Fixes: 7733f6c32e36 ("usb: cdns3: Add Cadence USB3 DRD Driver")
cc: <stable(a)vger.kernel.org>
Signed-off-by: Pawel Laszczak <pawell(a)cadence.com>
---
drivers/usb/cdns3/cdns3-gadget.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/drivers/usb/cdns3/cdns3-gadget.c b/drivers/usb/cdns3/cdns3-gadget.c
index 5adcb349718c..ccfaebca6faa 100644
--- a/drivers/usb/cdns3/cdns3-gadget.c
+++ b/drivers/usb/cdns3/cdns3-gadget.c
@@ -2614,6 +2614,7 @@ int cdns3_gadget_ep_dequeue(struct usb_ep *ep,
u8 req_on_hw_ring = 0;
unsigned long flags;
int ret = 0;
+ int val;
if (!ep || !request || !ep->desc)
return -EINVAL;
@@ -2649,6 +2650,13 @@ int cdns3_gadget_ep_dequeue(struct usb_ep *ep,
/* Update ring only if removed request is on pending_req_list list */
if (req_on_hw_ring && link_trb) {
+ /* Stop DMA */
+ writel(EP_CMD_DFLUSH, &priv_dev->regs->ep_cmd);
+
+ /* wait for DFLUSH cleared */
+ readl_poll_timeout_atomic(&priv_dev->regs->ep_cmd, val,
+ !(val & EP_CMD_DFLUSH), 1, 1000);
+
link_trb->buffer = cpu_to_le32(TRB_BUFFER(priv_ep->trb_pool_dma +
((priv_req->end_trb + 1) * TRB_SIZE)));
link_trb->control = cpu_to_le32((le32_to_cpu(link_trb->control) & TRB_CYCLE) |
@@ -2660,6 +2668,10 @@ int cdns3_gadget_ep_dequeue(struct usb_ep *ep,
cdns3_gadget_giveback(priv_ep, priv_req, -ECONNRESET);
+ req = cdns3_next_request(&priv_ep->pending_req_list);
+ if (req)
+ cdns3_rearm_transfer(priv_ep, 1);
+
not_found:
spin_unlock_irqrestore(&priv_dev->lock, flags);
return ret;
--
2.25.1
The patch titled
Subject: Revert "mm/compaction: fix set skip in fast_find_migrateblock"
has been added to the -mm mm-hotfixes-unstable branch. Its filename is
revert-mm-compaction-fix-set-skip-in-fast_find_migrateblock.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche…
This patch will later appear in the mm-hotfixes-unstable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there every 2-3 working days
------------------------------------------------------
From: Vlastimil Babka <vbabka(a)suse.cz>
Subject: Revert "mm/compaction: fix set skip in fast_find_migrateblock"
Date: Fri, 13 Jan 2023 18:33:45 +0100
This reverts commit 7efc3b7261030da79001c00d92bc3392fd6c664c.
We have got openSUSE reports (Link 1) for 6.1 kernel with khugepaged
stalling CPU for long periods of time. Investigation of tracepoint data
shows that compaction is stuck in repeating fast_find_migrateblock() based
migrate page isolation, and then fails to migrate all isolated pages.
Commit 7efc3b726103 ("mm/compaction: fix set skip in
fast_find_migrateblock") was suspected as it was merged in 6.1 and in
theory can indeed remove a termination condition for
fast_find_migrateblock() under certain conditions, as it removes a place
that always marks a scanned pageblock from being re-scanned. There are
other such places, but those can be skipped under certain conditions,
which seems to match the tracepoint data.
Testing of revert also appears to have resolved the issue, thus revert the
commit until a more robust solution for the original problem is developed.
It's also likely this will fix qemu stalls with 6.1 kernel reported in
Link 2, but that is not yet confirmed.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
Link: https://lore.kernel.org/kvm/b8017e09-f336-3035-8344-c549086c2340@kernel.org/
Link: https://lkml.kernel.org/r/20230113173345.9692-1-vbabka@suse.cz
Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
Cc: Chuyi Zhou <zhouchuyi(a)bytedance.com>
Cc: Jiri Slaby <jirislaby(a)kernel.org>
Cc: Maxim Levitsky <mlevitsk(a)redhat.com>
Cc: Mel Gorman <mgorman(a)techsingularity.net>
Cc: Michal Hocko <mhocko(a)kernel.org>
Cc: Paolo Bonzini <pbonzini(a)redhat.com>
Cc: Thorsten Leemhuis <regressions(a)leemhuis.info>
Cc: <stable(a)vger.kernel.org>
Signed-off-by: Andrew Morton <akpm(a)linux-foundation.org>
---
--- a/mm/compaction.c~revert-mm-compaction-fix-set-skip-in-fast_find_migrateblock
+++ a/mm/compaction.c
@@ -1839,6 +1839,7 @@ static unsigned long fast_find_migratebl
pfn = cc->zone->zone_start_pfn;
cc->fast_search_fail = 0;
found_block = true;
+ set_pageblock_skip(freepage);
break;
}
}
_
Patches currently in -mm which might be from vbabka(a)suse.cz are
revert-mm-compaction-fix-set-skip-in-fast_find_migrateblock.patch
This is the start of the stable review cycle for the 6.1.6 release.
There are 10 patches in this series, all will be posted as a response
to this one. If anyone has any issues with these being applied, please
let me know.
Responses should be made by Sat, 14 Jan 2023 13:53:18 +0000.
Anything received after that time might be too late.
The whole patch series can be found in one patch at:
https://www.kernel.org/pub/linux/kernel/v6.x/stable-review/patch-6.1.6-rc1.…
or in the git tree and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-6.1.y
and the diffstat can be found below.
thanks,
greg k-h
-------------
Pseudo-Shortlog of commits:
Greg Kroah-Hartman <gregkh(a)linuxfoundation.org>
Linux 6.1.6-rc1
Frederick Lawler <fred(a)cloudflare.com>
net: sched: disallow noqueue for qdisc classes
Linus Torvalds <torvalds(a)linux-foundation.org>
gcc: disable -Warray-bounds for gcc-11 too
Chuck Lever <chuck.lever(a)oracle.com>
Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"
Kyle Huey <me(a)kylehuey.com>
selftests/vm/pkeys: Add a regression test for setting PKRU through ptrace
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Emulate XRSTOR's behavior if the xfeatures PKRU bit is not set
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Allow PKRU to be (once again) written by ptrace.
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Add a pkru argument to copy_uabi_to_xstate()
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Add a pkru argument to copy_uabi_from_kernel_to_xstate().
Kyle Huey <me(a)kylehuey.com>
x86/fpu: Take task_struct* in copy_sigframe_from_user_to_xstate()
Helge Deller <deller(a)gmx.de>
parisc: Align parisc MADV_XXX constants with all other architectures
-------------
Diffstat:
Makefile | 4 +-
arch/parisc/include/uapi/asm/mman.h | 29 +++---
arch/parisc/kernel/sys_parisc.c | 28 ++++++
arch/parisc/kernel/syscalls/syscall.tbl | 2 +-
arch/x86/kernel/fpu/core.c | 19 ++--
arch/x86/kernel/fpu/regset.c | 2 +-
arch/x86/kernel/fpu/signal.c | 2 +-
arch/x86/kernel/fpu/xstate.c | 52 ++++++++++-
arch/x86/kernel/fpu/xstate.h | 4 +-
fs/nfsd/nfs4proc.c | 7 +-
fs/nfsd/nfs4xdr.c | 2 +-
init/Kconfig | 6 +-
net/sched/sch_api.c | 5 +
net/sunrpc/auth_gss/svcauth_gss.c | 4 +-
net/sunrpc/svc.c | 6 +-
net/sunrpc/svc_xprt.c | 2 +-
net/sunrpc/svcsock.c | 8 +-
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +-
tools/arch/parisc/include/uapi/asm/mman.h | 12 +--
tools/perf/bench/bench.h | 12 ---
tools/testing/selftests/vm/pkey-x86.h | 12 +++
tools/testing/selftests/vm/protection_keys.c | 131 ++++++++++++++++++++++++++-
22 files changed, 276 insertions(+), 75 deletions(-)
This reverts commit 9ccd11718d76b95c69aa773f2abedef560776037.
The original commit 16fb4dca95daa ("drm/amdgpu: getting fan speed pwm for vega10 properly")
was reverted in commit 4545ae2ed3f2 ("drm/amdgpu: Revert "drm/amdgpu: getting fan speed pwm for vega10 properly"").
but the test that resulted in the revert was wrong and was fixed so the
revert was reverted in commit 30b8e7b8ee3b ("Revert "drm/amdgpu: Revert "drm/amdgpu: getting fan speed pwm for vega10 properly""").
That should have been the end of it, but then Sasha picked up the
original revert again and it was committed as 9ccd11718d76. So drop
that commit so we get back to where we need to be.
Signed-off-by: Alex Deucher <alexander.deucher(a)amd.com>
Cc: Sasha Levin <sashal(a)kernel.org>
Cc: stable(a)vger.kernel.org # 6.1.x
Cc: Yury Zhuravlev <stalkerg(a)gmail.com>
Cc: Guchun Chen <guchun.chen(a)amd.com>
Cc: Asher Song <Asher.Song(a)amd.com>
---
.../amd/pm/powerplay/hwmgr/vega10_thermal.c | 25 +++++++++----------
1 file changed, 12 insertions(+), 13 deletions(-)
diff --git a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_thermal.c b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_thermal.c
index dad3e3741a4e..190af79f3236 100644
--- a/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_thermal.c
+++ b/drivers/gpu/drm/amd/pm/powerplay/hwmgr/vega10_thermal.c
@@ -67,22 +67,21 @@ int vega10_fan_ctrl_get_fan_speed_info(struct pp_hwmgr *hwmgr,
int vega10_fan_ctrl_get_fan_speed_pwm(struct pp_hwmgr *hwmgr,
uint32_t *speed)
{
- uint32_t current_rpm;
- uint32_t percent = 0;
-
- if (hwmgr->thermal_controller.fanInfo.bNoFan)
- return 0;
+ struct amdgpu_device *adev = hwmgr->adev;
+ uint32_t duty100, duty;
+ uint64_t tmp64;
- if (vega10_get_current_rpm(hwmgr, ¤t_rpm))
- return -1;
+ duty100 = REG_GET_FIELD(RREG32_SOC15(THM, 0, mmCG_FDO_CTRL1),
+ CG_FDO_CTRL1, FMAX_DUTY100);
+ duty = REG_GET_FIELD(RREG32_SOC15(THM, 0, mmCG_THERMAL_STATUS),
+ CG_THERMAL_STATUS, FDO_PWM_DUTY);
- if (hwmgr->thermal_controller.
- advanceFanControlParameters.usMaxFanRPM != 0)
- percent = current_rpm * 255 /
- hwmgr->thermal_controller.
- advanceFanControlParameters.usMaxFanRPM;
+ if (!duty100)
+ return -EINVAL;
- *speed = MIN(percent, 255);
+ tmp64 = (uint64_t)duty * 255;
+ do_div(tmp64, duty100);
+ *speed = MIN((uint32_t)tmp64, 255);
return 0;
}
--
2.39.0
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fe1f0714385f ("x86/resctrl: Fix task CLOSID/RMID update race")
e0ad6dc8969f ("x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI")
ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
758999246965 ("x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak")
fd8d9db3559a ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")
91989c707884 ("task_work: cleanup notification modes")
216578e55ac9 ("io_uring: fix REQ_F_COMP_LOCKED by killing it")
4edf20f99902 ("io_uring: dig out COMP_LOCK from deep call chain")
b1b74cfc1967 ("io_uring: don't unnecessarily clear F_LINK_TIMEOUT")
6ad4bf6ea160 ("Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fe1f0714385fbcf76b0cbceb02b7277d842014fc Mon Sep 17 00:00:00 2001
From: Peter Newman <peternewman(a)google.com>
Date: Tue, 20 Dec 2022 17:11:23 +0100
Subject: [PATCH] x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
Signed-off-by: Peter Newman <peternewman(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Babu Moger <babu.moger(a)amd.com>
Cc: <stable(a)kernel.org>
Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
+ * This pairs with the full barrier between the rq->curr update and
+ * resctrl_sched_in() during context switch.
*/
- barrier();
+ smp_mb();
/*
* By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);
+ /*
+ * Order the closid/rmid stores above before the loads
+ * in task_curr(). This pairs with the full barrier
+ * between the rq->curr update and resctrl_sched_in()
+ * during context switch.
+ */
+ smp_mb();
+
/*
* If the task is on a CPU, set the CPU in the mask.
* The detection is inaccurate as tasks might move or
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fe1f0714385f ("x86/resctrl: Fix task CLOSID/RMID update race")
e0ad6dc8969f ("x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI")
ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
758999246965 ("x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak")
fd8d9db3559a ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")
91989c707884 ("task_work: cleanup notification modes")
216578e55ac9 ("io_uring: fix REQ_F_COMP_LOCKED by killing it")
4edf20f99902 ("io_uring: dig out COMP_LOCK from deep call chain")
b1b74cfc1967 ("io_uring: don't unnecessarily clear F_LINK_TIMEOUT")
6ad4bf6ea160 ("Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fe1f0714385fbcf76b0cbceb02b7277d842014fc Mon Sep 17 00:00:00 2001
From: Peter Newman <peternewman(a)google.com>
Date: Tue, 20 Dec 2022 17:11:23 +0100
Subject: [PATCH] x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
Signed-off-by: Peter Newman <peternewman(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Babu Moger <babu.moger(a)amd.com>
Cc: <stable(a)kernel.org>
Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
+ * This pairs with the full barrier between the rq->curr update and
+ * resctrl_sched_in() during context switch.
*/
- barrier();
+ smp_mb();
/*
* By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);
+ /*
+ * Order the closid/rmid stores above before the loads
+ * in task_curr(). This pairs with the full barrier
+ * between the rq->curr update and resctrl_sched_in()
+ * during context switch.
+ */
+ smp_mb();
+
/*
* If the task is on a CPU, set the CPU in the mask.
* The detection is inaccurate as tasks might move or
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fe1f0714385f ("x86/resctrl: Fix task CLOSID/RMID update race")
e0ad6dc8969f ("x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI")
ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
758999246965 ("x86/resctrl: Add necessary kernfs_put() calls to prevent refcount leak")
fd8d9db3559a ("x86/resctrl: Remove superfluous kernfs_get() calls to prevent refcount leak")
91989c707884 ("task_work: cleanup notification modes")
216578e55ac9 ("io_uring: fix REQ_F_COMP_LOCKED by killing it")
4edf20f99902 ("io_uring: dig out COMP_LOCK from deep call chain")
b1b74cfc1967 ("io_uring: don't unnecessarily clear F_LINK_TIMEOUT")
6ad4bf6ea160 ("Merge tag 'io_uring-5.10-2020-10-12' of git://git.kernel.dk/linux-block")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fe1f0714385fbcf76b0cbceb02b7277d842014fc Mon Sep 17 00:00:00 2001
From: Peter Newman <peternewman(a)google.com>
Date: Tue, 20 Dec 2022 17:11:23 +0100
Subject: [PATCH] x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
Signed-off-by: Peter Newman <peternewman(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Babu Moger <babu.moger(a)amd.com>
Cc: <stable(a)kernel.org>
Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
+ * This pairs with the full barrier between the rq->curr update and
+ * resctrl_sched_in() during context switch.
*/
- barrier();
+ smp_mb();
/*
* By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);
+ /*
+ * Order the closid/rmid stores above before the loads
+ * in task_curr(). This pairs with the full barrier
+ * between the rq->curr update and resctrl_sched_in()
+ * during context switch.
+ */
+ smp_mb();
+
/*
* If the task is on a CPU, set the CPU in the mask.
* The detection is inaccurate as tasks might move or
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
fe1f0714385f ("x86/resctrl: Fix task CLOSID/RMID update race")
e0ad6dc8969f ("x86/resctrl: Use task_curr() instead of task_struct->on_cpu to prevent unnecessary IPI")
ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From fe1f0714385fbcf76b0cbceb02b7277d842014fc Mon Sep 17 00:00:00 2001
From: Peter Newman <peternewman(a)google.com>
Date: Tue, 20 Dec 2022 17:11:23 +0100
Subject: [PATCH] x86/resctrl: Fix task CLOSID/RMID update race
When the user moves a running task to a new rdtgroup using the task's
file interface or by deleting its rdtgroup, the resulting change in
CLOSID/RMID must be immediately propagated to the PQR_ASSOC MSR on the
task(s) CPUs.
x86 allows reordering loads with prior stores, so if the task starts
running between a task_curr() check that the CPU hoisted before the
stores in the CLOSID/RMID update then it can start running with the old
CLOSID/RMID until it is switched again because __rdtgroup_move_task()
failed to determine that it needs to be interrupted to obtain the new
CLOSID/RMID.
Refer to the diagram below:
CPU 0 CPU 1
----- -----
__rdtgroup_move_task():
curr <- t1->cpu->rq->curr
__schedule():
rq->curr <- t1
resctrl_sched_in():
t1->{closid,rmid} -> {1,1}
t1->{closid,rmid} <- {2,2}
if (curr == t1) // false
IPI(t1->cpu)
A similar race impacts rdt_move_group_tasks(), which updates tasks in a
deleted rdtgroup.
In both cases, use smp_mb() to order the task_struct::{closid,rmid}
stores before the loads in task_curr(). In particular, in the
rdt_move_group_tasks() case, simply execute an smp_mb() on every
iteration with a matching task.
It is possible to use a single smp_mb() in rdt_move_group_tasks(), but
this would require two passes and a means of remembering which
task_structs were updated in the first loop. However, benchmarking
results below showed too little performance impact in the simple
approach to justify implementing the two-pass approach.
Times below were collected using `perf stat` to measure the time to
remove a group containing a 1600-task, parallel workload.
CPU: Intel(R) Xeon(R) Platinum P-8136 CPU @ 2.00GHz (112 threads)
# mkdir /sys/fs/resctrl/test
# echo $$ > /sys/fs/resctrl/test/tasks
# perf bench sched messaging -g 40 -l 100000
task-clock time ranges collected using:
# perf stat rmdir /sys/fs/resctrl/test
Baseline: 1.54 - 1.60 ms
smp_mb() every matching task: 1.57 - 1.67 ms
[ bp: Massage commit message. ]
Fixes: ae28d1aae48a ("x86/resctrl: Use an IPI instead of task_work_add() to update PQR_ASSOC MSR")
Fixes: 0efc89be9471 ("x86/intel_rdt: Update task closid immediately on CPU in rmdir and unmount")
Signed-off-by: Peter Newman <peternewman(a)google.com>
Signed-off-by: Borislav Petkov (AMD) <bp(a)alien8.de>
Reviewed-by: Reinette Chatre <reinette.chatre(a)intel.com>
Reviewed-by: Babu Moger <babu.moger(a)amd.com>
Cc: <stable(a)kernel.org>
Link: https://lore.kernel.org/r/20221220161123.432120-1-peternewman@google.com
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..5993da21d822 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -580,8 +580,10 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
/*
* Ensure the task's closid and rmid are written before determining if
* the task is current that will decide if it will be interrupted.
+ * This pairs with the full barrier between the rq->curr update and
+ * resctrl_sched_in() during context switch.
*/
- barrier();
+ smp_mb();
/*
* By now, the task's closid and rmid are set. If the task is current
@@ -2401,6 +2403,14 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);
+ /*
+ * Order the closid/rmid stores above before the loads
+ * in task_curr(). This pairs with the full barrier
+ * between the rq->curr update and resctrl_sched_in()
+ * during context switch.
+ */
+ smp_mb();
+
/*
* If the task is on a CPU, set the CPU in the mask.
* The detection is inaccurate as tasks might move or
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
76d588dddc45 ("powerpc/imc-pmu: Fix use of mutex in IRQs disabled section")
a36e8ba60b99 ("powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.")
012ae244845f ("powerpc/perf: Trace imc PMU functions")
72c69dcddce1 ("powerpc/perf: Trace imc events detection and cpuhotplug")
dd50cf7cbc7b ("powerpc/perf: Rearrange setting of ldbar for thread-imc")
4851f75098bc ("powerpc/perf: Declare static identifier a such")
5e2d059b52e3 ("Merge tag 'powerpc-4.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76d588dddc459fefa1da96e0a081a397c5c8e216 Mon Sep 17 00:00:00 2001
From: Kajol Jain <kjain(a)linux.ibm.com>
Date: Fri, 6 Jan 2023 12:21:57 +0530
Subject: [PATCH] powerpc/imc-pmu: Fix use of mutex in IRQs disabled section
Current imc-pmu code triggers a WARNING with CONFIG_DEBUG_ATOMIC_SLEEP
and CONFIG_PROVE_LOCKING enabled, while running a thread_imc event.
Command to trigger the warning:
# perf stat -e thread_imc/CPM_CS_FROM_L4_MEM_X_DPTEG/ sleep 5
Performance counter stats for 'sleep 5':
0 thread_imc/CPM_CS_FROM_L4_MEM_X_DPTEG/
5.002117947 seconds time elapsed
0.000131000 seconds user
0.001063000 seconds sys
Below is snippet of the warning in dmesg:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2869, name: perf-exec
preempt_count: 2, expected: 0
4 locks held by perf-exec/2869:
#0: c00000004325c540 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x64/0xa90
#1: c00000004325c5d8 (&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x460/0xef0
#2: c0000003fa99d4e0 (&cpuctx_lock){-...}-{2:2}, at: perf_event_exec+0x290/0x510
#3: c000000017ab8418 (&ctx->lock){....}-{2:2}, at: perf_event_exec+0x29c/0x510
irq event stamp: 4806
hardirqs last enabled at (4805): [<c000000000f65b94>] _raw_spin_unlock_irqrestore+0x94/0xd0
hardirqs last disabled at (4806): [<c0000000003fae44>] perf_event_exec+0x394/0x510
softirqs last enabled at (0): [<c00000000013c404>] copy_process+0xc34/0x1ff0
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 36 PID: 2869 Comm: perf-exec Not tainted 6.2.0-rc2-00011-g1247637727f2 #61
Hardware name: 8375-42A POWER9 0x4e1202 opal:v7.0-16-g9b85f7d961 PowerNV
Call Trace:
dump_stack_lvl+0x98/0xe0 (unreliable)
__might_resched+0x2f8/0x310
__mutex_lock+0x6c/0x13f0
thread_imc_event_add+0xf4/0x1b0
event_sched_in+0xe0/0x210
merge_sched_in+0x1f0/0x600
visit_groups_merge.isra.92.constprop.166+0x2bc/0x6c0
ctx_flexible_sched_in+0xcc/0x140
ctx_sched_in+0x20c/0x2a0
ctx_resched+0x104/0x1c0
perf_event_exec+0x340/0x510
begin_new_exec+0x730/0xef0
load_elf_binary+0x3f8/0x1e10
...
do not call blocking ops when !TASK_RUNNING; state=2001 set at [<00000000fd63e7cf>] do_nanosleep+0x60/0x1a0
WARNING: CPU: 36 PID: 2869 at kernel/sched/core.c:9912 __might_sleep+0x9c/0xb0
CPU: 36 PID: 2869 Comm: sleep Tainted: G W 6.2.0-rc2-00011-g1247637727f2 #61
Hardware name: 8375-42A POWER9 0x4e1202 opal:v7.0-16-g9b85f7d961 PowerNV
NIP: c000000000194a1c LR: c000000000194a18 CTR: c000000000a78670
REGS: c00000004d2134e0 TRAP: 0700 Tainted: G W (6.2.0-rc2-00011-g1247637727f2)
MSR: 9000000000021033 <SF,HV,ME,IR,DR,RI,LE> CR: 48002824 XER: 00000000
CFAR: c00000000013fb64 IRQMASK: 1
The above warning triggered because the current imc-pmu code uses mutex
lock in interrupt disabled sections. The function mutex_lock()
internally calls __might_resched(), which will check if IRQs are
disabled and in case IRQs are disabled, it will trigger the warning.
Fix the issue by changing the mutex lock to spinlock.
Fixes: 8f95faaac56c ("powerpc/powernv: Detect and create IMC device")
Reported-by: Michael Petlan <mpetlan(a)redhat.com>
Reported-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Kajol Jain <kjain(a)linux.ibm.com>
[mpe: Fix comments, trim oops in change log, add reported-by tags]
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20230106065157.182648-1-kjain@linux.ibm.com
diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
index 4f897993b710..699a88584ae1 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -137,7 +137,7 @@ struct imc_pmu {
* are inited.
*/
struct imc_pmu_ref {
- struct mutex lock;
+ spinlock_t lock;
unsigned int id;
int refc;
};
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index d517aba94d1b..100e97daf76b 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -14,6 +14,7 @@
#include <asm/cputhreads.h>
#include <asm/smp.h>
#include <linux/string.h>
+#include <linux/spinlock.h>
/* Nest IMC data structures and variables */
@@ -21,7 +22,7 @@
* Used to avoid races in counting the nest-pmu units during hotplug
* register and unregister
*/
-static DEFINE_MUTEX(nest_init_lock);
+static DEFINE_SPINLOCK(nest_init_lock);
static DEFINE_PER_CPU(struct imc_pmu_ref *, local_nest_imc_refc);
static struct imc_pmu **per_nest_pmu_arr;
static cpumask_t nest_imc_cpumask;
@@ -50,7 +51,7 @@ static int trace_imc_mem_size;
* core and trace-imc
*/
static struct imc_pmu_ref imc_global_refc = {
- .lock = __MUTEX_INITIALIZER(imc_global_refc.lock),
+ .lock = __SPIN_LOCK_INITIALIZER(imc_global_refc.lock),
.id = 0,
.refc = 0,
};
@@ -400,7 +401,7 @@ static int ppc_nest_imc_cpu_offline(unsigned int cpu)
get_hard_smp_processor_id(cpu));
/*
* If this is the last cpu in this chip then, skip the reference
- * count mutex lock and make the reference count on this chip zero.
+ * count lock and make the reference count on this chip zero.
*/
ref = get_nest_pmu_ref(cpu);
if (!ref)
@@ -462,15 +463,15 @@ static void nest_imc_counters_release(struct perf_event *event)
/*
* See if we need to disable the nest PMU.
* If no events are currently in use, then we have to take a
- * mutex to ensure that we don't race with another task doing
+ * lock to ensure that we don't race with another task doing
* enable or disable the nest counters.
*/
ref = get_nest_pmu_ref(event->cpu);
if (!ref)
return;
- /* Take the mutex lock for this node and then decrement the reference count */
- mutex_lock(&ref->lock);
+ /* Take the lock for this node and then decrement the reference count */
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
/*
* The scenario where this is true is, when perf session is
@@ -482,7 +483,7 @@ static void nest_imc_counters_release(struct perf_event *event)
* an OPAL call to disable the engine in that node.
*
*/
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return;
}
ref->refc--;
@@ -490,7 +491,7 @@ static void nest_imc_counters_release(struct perf_event *event)
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_NEST,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("nest-imc: Unable to stop the counters for core %d\n", node_id);
return;
}
@@ -498,7 +499,7 @@ static void nest_imc_counters_release(struct perf_event *event)
WARN(1, "nest-imc: Invalid event reference count\n");
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
}
static int nest_imc_event_init(struct perf_event *event)
@@ -557,26 +558,25 @@ static int nest_imc_event_init(struct perf_event *event)
/*
* Get the imc_pmu_ref struct for this node.
- * Take the mutex lock and then increment the count of nest pmu events
- * inited.
+ * Take the lock and then increment the count of nest pmu events inited.
*/
ref = get_nest_pmu_ref(event->cpu);
if (!ref)
return -EINVAL;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_NEST,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("nest-imc: Unable to start the counters for node %d\n",
node_id);
return rc;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
event->destroy = nest_imc_counters_release;
return 0;
@@ -612,9 +612,8 @@ static int core_imc_mem_init(int cpu, int size)
return -ENOMEM;
mem_info->vbase = page_address(page);
- /* Init the mutex */
core_imc_refc[core_id].id = core_id;
- mutex_init(&core_imc_refc[core_id].lock);
+ spin_lock_init(&core_imc_refc[core_id].lock);
rc = opal_imc_counters_init(OPAL_IMC_COUNTERS_CORE,
__pa((void *)mem_info->vbase),
@@ -703,9 +702,8 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
perf_pmu_migrate_context(&core_imc_pmu->pmu, cpu, ncpu);
} else {
/*
- * If this is the last cpu in this core then, skip taking refernce
- * count mutex lock for this core and directly zero "refc" for
- * this core.
+ * If this is the last cpu in this core then skip taking reference
+ * count lock for this core and directly zero "refc" for this core.
*/
opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(cpu));
@@ -720,11 +718,11 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
* last cpu in this core and core-imc event running
* in this cpu.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_CORE)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
}
return 0;
}
@@ -739,7 +737,7 @@ static int core_imc_pmu_cpumask_init(void)
static void reset_global_refc(struct perf_event *event)
{
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
imc_global_refc.refc--;
/*
@@ -751,7 +749,7 @@ static void reset_global_refc(struct perf_event *event)
imc_global_refc.refc = 0;
imc_global_refc.id = 0;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
}
static void core_imc_counters_release(struct perf_event *event)
@@ -764,17 +762,17 @@ static void core_imc_counters_release(struct perf_event *event)
/*
* See if we need to disable the IMC PMU.
* If no events are currently in use, then we have to take a
- * mutex to ensure that we don't race with another task doing
+ * lock to ensure that we don't race with another task doing
* enable or disable the core counters.
*/
core_id = event->cpu / threads_per_core;
- /* Take the mutex lock and decrement the refernce count for this core */
+ /* Take the lock and decrement the refernce count for this core */
ref = &core_imc_refc[core_id];
if (!ref)
return;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
/*
* The scenario where this is true is, when perf session is
@@ -786,7 +784,7 @@ static void core_imc_counters_release(struct perf_event *event)
* an OPAL call to disable the engine in that core.
*
*/
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return;
}
ref->refc--;
@@ -794,7 +792,7 @@ static void core_imc_counters_release(struct perf_event *event)
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("IMC: Unable to stop the counters for core %d\n", core_id);
return;
}
@@ -802,7 +800,7 @@ static void core_imc_counters_release(struct perf_event *event)
WARN(1, "core-imc: Invalid event reference count\n");
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
reset_global_refc(event);
}
@@ -840,7 +838,6 @@ static int core_imc_event_init(struct perf_event *event)
if ((!pcmi->vbase))
return -ENODEV;
- /* Get the core_imc mutex for this core */
ref = &core_imc_refc[core_id];
if (!ref)
return -EINVAL;
@@ -848,22 +845,22 @@ static int core_imc_event_init(struct perf_event *event)
/*
* Core pmu units are enabled only when it is used.
* See if this is triggered for the first time.
- * If yes, take the mutex lock and enable the core counters.
+ * If yes, take the lock and enable the core counters.
* If not, just increment the count in core_imc_refc struct.
*/
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("core-imc: Unable to start the counters for core %d\n",
core_id);
return rc;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
/*
* Since the system can run either in accumulation or trace-mode
@@ -874,7 +871,7 @@ static int core_imc_event_init(struct perf_event *event)
* to know whether any other trace/thread imc
* events are running.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == 0 || imc_global_refc.id == IMC_DOMAIN_CORE) {
/*
* No other trace/thread imc events are running in
@@ -883,10 +880,10 @@ static int core_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_CORE;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->hw.event_base = (u64)pcmi->vbase + (config & IMC_EVENT_OFFSET_MASK);
event->destroy = core_imc_counters_release;
@@ -958,10 +955,10 @@ static int ppc_thread_imc_cpu_offline(unsigned int cpu)
mtspr(SPRN_LDBAR, (mfspr(SPRN_LDBAR) & (~(1UL << 63))));
/* Reduce the refc if thread-imc event running on this cpu */
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_THREAD)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return 0;
}
@@ -1001,7 +998,7 @@ static int thread_imc_event_init(struct perf_event *event)
if (!target)
return -EINVAL;
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
/*
* Check if any other trace/core imc events are running in the
* system, if not set the global id to thread-imc.
@@ -1010,10 +1007,10 @@ static int thread_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_THREAD;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->pmu->task_ctx_nr = perf_sw_context;
event->destroy = reset_global_refc;
@@ -1135,25 +1132,25 @@ static int thread_imc_event_add(struct perf_event *event, int flags)
/*
* imc pmus are enabled only when it is used.
* See if this is triggered for the first time.
- * If yes, take the mutex lock and enable the counters.
+ * If yes, take the lock and enable the counters.
* If not, just increment the count in ref count struct.
*/
ref = &core_imc_refc[core_id];
if (!ref)
return -EINVAL;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
if (opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("thread-imc: Unable to start the counter\
for core %d\n", core_id);
return -EINVAL;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return 0;
}
@@ -1170,12 +1167,12 @@ static void thread_imc_event_del(struct perf_event *event, int flags)
return;
}
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
ref->refc--;
if (ref->refc == 0) {
if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("thread-imc: Unable to stop the counters\
for core %d\n", core_id);
return;
@@ -1183,7 +1180,7 @@ static void thread_imc_event_del(struct perf_event *event, int flags)
} else if (ref->refc < 0) {
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
/* Set bit 0 of LDBAR to zero, to stop posting updates to memory */
mtspr(SPRN_LDBAR, (mfspr(SPRN_LDBAR) & (~(1UL << 63))));
@@ -1224,9 +1221,8 @@ static int trace_imc_mem_alloc(int cpu_id, int size)
}
}
- /* Init the mutex, if not already */
trace_imc_refc[core_id].id = core_id;
- mutex_init(&trace_imc_refc[core_id].lock);
+ spin_lock_init(&trace_imc_refc[core_id].lock);
mtspr(SPRN_LDBAR, 0);
return 0;
@@ -1246,10 +1242,10 @@ static int ppc_trace_imc_cpu_offline(unsigned int cpu)
* Reduce the refc if any trace-imc event running
* on this cpu.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_TRACE)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return 0;
}
@@ -1371,17 +1367,17 @@ static int trace_imc_event_add(struct perf_event *event, int flags)
}
mtspr(SPRN_LDBAR, ldbar_value);
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
if (opal_imc_counters_start(OPAL_IMC_COUNTERS_TRACE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("trace-imc: Unable to start the counters for core %d\n", core_id);
return -EINVAL;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return 0;
}
@@ -1414,19 +1410,19 @@ static void trace_imc_event_del(struct perf_event *event, int flags)
return;
}
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
ref->refc--;
if (ref->refc == 0) {
if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_TRACE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("trace-imc: Unable to stop the counters for core %d\n", core_id);
return;
}
} else if (ref->refc < 0) {
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
trace_imc_event_stop(event, flags);
}
@@ -1448,7 +1444,7 @@ static int trace_imc_event_init(struct perf_event *event)
* no other thread is running any core/thread imc
* events
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == 0 || imc_global_refc.id == IMC_DOMAIN_TRACE) {
/*
* No core/thread imc events are running in the
@@ -1457,10 +1453,10 @@ static int trace_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_TRACE;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->hw.idx = -1;
@@ -1533,10 +1529,10 @@ static int init_nest_pmu_ref(void)
i = 0;
for_each_node(nid) {
/*
- * Mutex lock to avoid races while tracking the number of
+ * Take the lock to avoid races while tracking the number of
* sessions using the chip's nest pmu units.
*/
- mutex_init(&nest_imc_refc[i].lock);
+ spin_lock_init(&nest_imc_refc[i].lock);
/*
* Loop to init the "id" with the node_id. Variable "i" initialized to
@@ -1633,7 +1629,7 @@ static void imc_common_mem_free(struct imc_pmu *pmu_ptr)
static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr)
{
if (pmu_ptr->domain == IMC_DOMAIN_NEST) {
- mutex_lock(&nest_init_lock);
+ spin_lock(&nest_init_lock);
if (nest_pmus == 1) {
cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE);
kfree(nest_imc_refc);
@@ -1643,7 +1639,7 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr)
if (nest_pmus > 0)
nest_pmus--;
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
}
/* Free core_imc memory */
@@ -1800,11 +1796,11 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
* rest. To handle the cpuhotplug callback unregister, we track
* the number of nest pmus in "nest_pmus".
*/
- mutex_lock(&nest_init_lock);
+ spin_lock(&nest_init_lock);
if (nest_pmus == 0) {
ret = init_nest_pmu_ref();
if (ret) {
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
kfree(per_nest_pmu_arr);
per_nest_pmu_arr = NULL;
goto err_free_mem;
@@ -1812,7 +1808,7 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
/* Register for cpu hotplug notification. */
ret = nest_pmu_cpumask_init();
if (ret) {
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
kfree(nest_imc_refc);
kfree(per_nest_pmu_arr);
per_nest_pmu_arr = NULL;
@@ -1820,7 +1816,7 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
}
}
nest_pmus++;
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
break;
case IMC_DOMAIN_CORE:
ret = core_imc_pmu_cpumask_init();
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
76d588dddc45 ("powerpc/imc-pmu: Fix use of mutex in IRQs disabled section")
a36e8ba60b99 ("powerpc/perf: Implement a global lock to avoid races between trace, core and thread imc events.")
012ae244845f ("powerpc/perf: Trace imc PMU functions")
72c69dcddce1 ("powerpc/perf: Trace imc events detection and cpuhotplug")
dd50cf7cbc7b ("powerpc/perf: Rearrange setting of ldbar for thread-imc")
4851f75098bc ("powerpc/perf: Declare static identifier a such")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 76d588dddc459fefa1da96e0a081a397c5c8e216 Mon Sep 17 00:00:00 2001
From: Kajol Jain <kjain(a)linux.ibm.com>
Date: Fri, 6 Jan 2023 12:21:57 +0530
Subject: [PATCH] powerpc/imc-pmu: Fix use of mutex in IRQs disabled section
Current imc-pmu code triggers a WARNING with CONFIG_DEBUG_ATOMIC_SLEEP
and CONFIG_PROVE_LOCKING enabled, while running a thread_imc event.
Command to trigger the warning:
# perf stat -e thread_imc/CPM_CS_FROM_L4_MEM_X_DPTEG/ sleep 5
Performance counter stats for 'sleep 5':
0 thread_imc/CPM_CS_FROM_L4_MEM_X_DPTEG/
5.002117947 seconds time elapsed
0.000131000 seconds user
0.001063000 seconds sys
Below is snippet of the warning in dmesg:
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2869, name: perf-exec
preempt_count: 2, expected: 0
4 locks held by perf-exec/2869:
#0: c00000004325c540 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0x64/0xa90
#1: c00000004325c5d8 (&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exec+0x460/0xef0
#2: c0000003fa99d4e0 (&cpuctx_lock){-...}-{2:2}, at: perf_event_exec+0x290/0x510
#3: c000000017ab8418 (&ctx->lock){....}-{2:2}, at: perf_event_exec+0x29c/0x510
irq event stamp: 4806
hardirqs last enabled at (4805): [<c000000000f65b94>] _raw_spin_unlock_irqrestore+0x94/0xd0
hardirqs last disabled at (4806): [<c0000000003fae44>] perf_event_exec+0x394/0x510
softirqs last enabled at (0): [<c00000000013c404>] copy_process+0xc34/0x1ff0
softirqs last disabled at (0): [<0000000000000000>] 0x0
CPU: 36 PID: 2869 Comm: perf-exec Not tainted 6.2.0-rc2-00011-g1247637727f2 #61
Hardware name: 8375-42A POWER9 0x4e1202 opal:v7.0-16-g9b85f7d961 PowerNV
Call Trace:
dump_stack_lvl+0x98/0xe0 (unreliable)
__might_resched+0x2f8/0x310
__mutex_lock+0x6c/0x13f0
thread_imc_event_add+0xf4/0x1b0
event_sched_in+0xe0/0x210
merge_sched_in+0x1f0/0x600
visit_groups_merge.isra.92.constprop.166+0x2bc/0x6c0
ctx_flexible_sched_in+0xcc/0x140
ctx_sched_in+0x20c/0x2a0
ctx_resched+0x104/0x1c0
perf_event_exec+0x340/0x510
begin_new_exec+0x730/0xef0
load_elf_binary+0x3f8/0x1e10
...
do not call blocking ops when !TASK_RUNNING; state=2001 set at [<00000000fd63e7cf>] do_nanosleep+0x60/0x1a0
WARNING: CPU: 36 PID: 2869 at kernel/sched/core.c:9912 __might_sleep+0x9c/0xb0
CPU: 36 PID: 2869 Comm: sleep Tainted: G W 6.2.0-rc2-00011-g1247637727f2 #61
Hardware name: 8375-42A POWER9 0x4e1202 opal:v7.0-16-g9b85f7d961 PowerNV
NIP: c000000000194a1c LR: c000000000194a18 CTR: c000000000a78670
REGS: c00000004d2134e0 TRAP: 0700 Tainted: G W (6.2.0-rc2-00011-g1247637727f2)
MSR: 9000000000021033 <SF,HV,ME,IR,DR,RI,LE> CR: 48002824 XER: 00000000
CFAR: c00000000013fb64 IRQMASK: 1
The above warning triggered because the current imc-pmu code uses mutex
lock in interrupt disabled sections. The function mutex_lock()
internally calls __might_resched(), which will check if IRQs are
disabled and in case IRQs are disabled, it will trigger the warning.
Fix the issue by changing the mutex lock to spinlock.
Fixes: 8f95faaac56c ("powerpc/powernv: Detect and create IMC device")
Reported-by: Michael Petlan <mpetlan(a)redhat.com>
Reported-by: Peter Zijlstra <peterz(a)infradead.org>
Signed-off-by: Kajol Jain <kjain(a)linux.ibm.com>
[mpe: Fix comments, trim oops in change log, add reported-by tags]
Signed-off-by: Michael Ellerman <mpe(a)ellerman.id.au>
Link: https://lore.kernel.org/r/20230106065157.182648-1-kjain@linux.ibm.com
diff --git a/arch/powerpc/include/asm/imc-pmu.h b/arch/powerpc/include/asm/imc-pmu.h
index 4f897993b710..699a88584ae1 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -137,7 +137,7 @@ struct imc_pmu {
* are inited.
*/
struct imc_pmu_ref {
- struct mutex lock;
+ spinlock_t lock;
unsigned int id;
int refc;
};
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index d517aba94d1b..100e97daf76b 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -14,6 +14,7 @@
#include <asm/cputhreads.h>
#include <asm/smp.h>
#include <linux/string.h>
+#include <linux/spinlock.h>
/* Nest IMC data structures and variables */
@@ -21,7 +22,7 @@
* Used to avoid races in counting the nest-pmu units during hotplug
* register and unregister
*/
-static DEFINE_MUTEX(nest_init_lock);
+static DEFINE_SPINLOCK(nest_init_lock);
static DEFINE_PER_CPU(struct imc_pmu_ref *, local_nest_imc_refc);
static struct imc_pmu **per_nest_pmu_arr;
static cpumask_t nest_imc_cpumask;
@@ -50,7 +51,7 @@ static int trace_imc_mem_size;
* core and trace-imc
*/
static struct imc_pmu_ref imc_global_refc = {
- .lock = __MUTEX_INITIALIZER(imc_global_refc.lock),
+ .lock = __SPIN_LOCK_INITIALIZER(imc_global_refc.lock),
.id = 0,
.refc = 0,
};
@@ -400,7 +401,7 @@ static int ppc_nest_imc_cpu_offline(unsigned int cpu)
get_hard_smp_processor_id(cpu));
/*
* If this is the last cpu in this chip then, skip the reference
- * count mutex lock and make the reference count on this chip zero.
+ * count lock and make the reference count on this chip zero.
*/
ref = get_nest_pmu_ref(cpu);
if (!ref)
@@ -462,15 +463,15 @@ static void nest_imc_counters_release(struct perf_event *event)
/*
* See if we need to disable the nest PMU.
* If no events are currently in use, then we have to take a
- * mutex to ensure that we don't race with another task doing
+ * lock to ensure that we don't race with another task doing
* enable or disable the nest counters.
*/
ref = get_nest_pmu_ref(event->cpu);
if (!ref)
return;
- /* Take the mutex lock for this node and then decrement the reference count */
- mutex_lock(&ref->lock);
+ /* Take the lock for this node and then decrement the reference count */
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
/*
* The scenario where this is true is, when perf session is
@@ -482,7 +483,7 @@ static void nest_imc_counters_release(struct perf_event *event)
* an OPAL call to disable the engine in that node.
*
*/
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return;
}
ref->refc--;
@@ -490,7 +491,7 @@ static void nest_imc_counters_release(struct perf_event *event)
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_NEST,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("nest-imc: Unable to stop the counters for core %d\n", node_id);
return;
}
@@ -498,7 +499,7 @@ static void nest_imc_counters_release(struct perf_event *event)
WARN(1, "nest-imc: Invalid event reference count\n");
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
}
static int nest_imc_event_init(struct perf_event *event)
@@ -557,26 +558,25 @@ static int nest_imc_event_init(struct perf_event *event)
/*
* Get the imc_pmu_ref struct for this node.
- * Take the mutex lock and then increment the count of nest pmu events
- * inited.
+ * Take the lock and then increment the count of nest pmu events inited.
*/
ref = get_nest_pmu_ref(event->cpu);
if (!ref)
return -EINVAL;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_NEST,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("nest-imc: Unable to start the counters for node %d\n",
node_id);
return rc;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
event->destroy = nest_imc_counters_release;
return 0;
@@ -612,9 +612,8 @@ static int core_imc_mem_init(int cpu, int size)
return -ENOMEM;
mem_info->vbase = page_address(page);
- /* Init the mutex */
core_imc_refc[core_id].id = core_id;
- mutex_init(&core_imc_refc[core_id].lock);
+ spin_lock_init(&core_imc_refc[core_id].lock);
rc = opal_imc_counters_init(OPAL_IMC_COUNTERS_CORE,
__pa((void *)mem_info->vbase),
@@ -703,9 +702,8 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
perf_pmu_migrate_context(&core_imc_pmu->pmu, cpu, ncpu);
} else {
/*
- * If this is the last cpu in this core then, skip taking refernce
- * count mutex lock for this core and directly zero "refc" for
- * this core.
+ * If this is the last cpu in this core then skip taking reference
+ * count lock for this core and directly zero "refc" for this core.
*/
opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(cpu));
@@ -720,11 +718,11 @@ static int ppc_core_imc_cpu_offline(unsigned int cpu)
* last cpu in this core and core-imc event running
* in this cpu.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_CORE)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
}
return 0;
}
@@ -739,7 +737,7 @@ static int core_imc_pmu_cpumask_init(void)
static void reset_global_refc(struct perf_event *event)
{
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
imc_global_refc.refc--;
/*
@@ -751,7 +749,7 @@ static void reset_global_refc(struct perf_event *event)
imc_global_refc.refc = 0;
imc_global_refc.id = 0;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
}
static void core_imc_counters_release(struct perf_event *event)
@@ -764,17 +762,17 @@ static void core_imc_counters_release(struct perf_event *event)
/*
* See if we need to disable the IMC PMU.
* If no events are currently in use, then we have to take a
- * mutex to ensure that we don't race with another task doing
+ * lock to ensure that we don't race with another task doing
* enable or disable the core counters.
*/
core_id = event->cpu / threads_per_core;
- /* Take the mutex lock and decrement the refernce count for this core */
+ /* Take the lock and decrement the refernce count for this core */
ref = &core_imc_refc[core_id];
if (!ref)
return;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
/*
* The scenario where this is true is, when perf session is
@@ -786,7 +784,7 @@ static void core_imc_counters_release(struct perf_event *event)
* an OPAL call to disable the engine in that core.
*
*/
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return;
}
ref->refc--;
@@ -794,7 +792,7 @@ static void core_imc_counters_release(struct perf_event *event)
rc = opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("IMC: Unable to stop the counters for core %d\n", core_id);
return;
}
@@ -802,7 +800,7 @@ static void core_imc_counters_release(struct perf_event *event)
WARN(1, "core-imc: Invalid event reference count\n");
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
reset_global_refc(event);
}
@@ -840,7 +838,6 @@ static int core_imc_event_init(struct perf_event *event)
if ((!pcmi->vbase))
return -ENODEV;
- /* Get the core_imc mutex for this core */
ref = &core_imc_refc[core_id];
if (!ref)
return -EINVAL;
@@ -848,22 +845,22 @@ static int core_imc_event_init(struct perf_event *event)
/*
* Core pmu units are enabled only when it is used.
* See if this is triggered for the first time.
- * If yes, take the mutex lock and enable the core counters.
+ * If yes, take the lock and enable the core counters.
* If not, just increment the count in core_imc_refc struct.
*/
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
rc = opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(event->cpu));
if (rc) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("core-imc: Unable to start the counters for core %d\n",
core_id);
return rc;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
/*
* Since the system can run either in accumulation or trace-mode
@@ -874,7 +871,7 @@ static int core_imc_event_init(struct perf_event *event)
* to know whether any other trace/thread imc
* events are running.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == 0 || imc_global_refc.id == IMC_DOMAIN_CORE) {
/*
* No other trace/thread imc events are running in
@@ -883,10 +880,10 @@ static int core_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_CORE;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->hw.event_base = (u64)pcmi->vbase + (config & IMC_EVENT_OFFSET_MASK);
event->destroy = core_imc_counters_release;
@@ -958,10 +955,10 @@ static int ppc_thread_imc_cpu_offline(unsigned int cpu)
mtspr(SPRN_LDBAR, (mfspr(SPRN_LDBAR) & (~(1UL << 63))));
/* Reduce the refc if thread-imc event running on this cpu */
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_THREAD)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return 0;
}
@@ -1001,7 +998,7 @@ static int thread_imc_event_init(struct perf_event *event)
if (!target)
return -EINVAL;
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
/*
* Check if any other trace/core imc events are running in the
* system, if not set the global id to thread-imc.
@@ -1010,10 +1007,10 @@ static int thread_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_THREAD;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->pmu->task_ctx_nr = perf_sw_context;
event->destroy = reset_global_refc;
@@ -1135,25 +1132,25 @@ static int thread_imc_event_add(struct perf_event *event, int flags)
/*
* imc pmus are enabled only when it is used.
* See if this is triggered for the first time.
- * If yes, take the mutex lock and enable the counters.
+ * If yes, take the lock and enable the counters.
* If not, just increment the count in ref count struct.
*/
ref = &core_imc_refc[core_id];
if (!ref)
return -EINVAL;
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
if (opal_imc_counters_start(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("thread-imc: Unable to start the counter\
for core %d\n", core_id);
return -EINVAL;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return 0;
}
@@ -1170,12 +1167,12 @@ static void thread_imc_event_del(struct perf_event *event, int flags)
return;
}
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
ref->refc--;
if (ref->refc == 0) {
if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_CORE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("thread-imc: Unable to stop the counters\
for core %d\n", core_id);
return;
@@ -1183,7 +1180,7 @@ static void thread_imc_event_del(struct perf_event *event, int flags)
} else if (ref->refc < 0) {
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
/* Set bit 0 of LDBAR to zero, to stop posting updates to memory */
mtspr(SPRN_LDBAR, (mfspr(SPRN_LDBAR) & (~(1UL << 63))));
@@ -1224,9 +1221,8 @@ static int trace_imc_mem_alloc(int cpu_id, int size)
}
}
- /* Init the mutex, if not already */
trace_imc_refc[core_id].id = core_id;
- mutex_init(&trace_imc_refc[core_id].lock);
+ spin_lock_init(&trace_imc_refc[core_id].lock);
mtspr(SPRN_LDBAR, 0);
return 0;
@@ -1246,10 +1242,10 @@ static int ppc_trace_imc_cpu_offline(unsigned int cpu)
* Reduce the refc if any trace-imc event running
* on this cpu.
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == IMC_DOMAIN_TRACE)
imc_global_refc.refc--;
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return 0;
}
@@ -1371,17 +1367,17 @@ static int trace_imc_event_add(struct perf_event *event, int flags)
}
mtspr(SPRN_LDBAR, ldbar_value);
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
if (ref->refc == 0) {
if (opal_imc_counters_start(OPAL_IMC_COUNTERS_TRACE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("trace-imc: Unable to start the counters for core %d\n", core_id);
return -EINVAL;
}
}
++ref->refc;
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
return 0;
}
@@ -1414,19 +1410,19 @@ static void trace_imc_event_del(struct perf_event *event, int flags)
return;
}
- mutex_lock(&ref->lock);
+ spin_lock(&ref->lock);
ref->refc--;
if (ref->refc == 0) {
if (opal_imc_counters_stop(OPAL_IMC_COUNTERS_TRACE,
get_hard_smp_processor_id(smp_processor_id()))) {
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
pr_err("trace-imc: Unable to stop the counters for core %d\n", core_id);
return;
}
} else if (ref->refc < 0) {
ref->refc = 0;
}
- mutex_unlock(&ref->lock);
+ spin_unlock(&ref->lock);
trace_imc_event_stop(event, flags);
}
@@ -1448,7 +1444,7 @@ static int trace_imc_event_init(struct perf_event *event)
* no other thread is running any core/thread imc
* events
*/
- mutex_lock(&imc_global_refc.lock);
+ spin_lock(&imc_global_refc.lock);
if (imc_global_refc.id == 0 || imc_global_refc.id == IMC_DOMAIN_TRACE) {
/*
* No core/thread imc events are running in the
@@ -1457,10 +1453,10 @@ static int trace_imc_event_init(struct perf_event *event)
imc_global_refc.id = IMC_DOMAIN_TRACE;
imc_global_refc.refc++;
} else {
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
return -EBUSY;
}
- mutex_unlock(&imc_global_refc.lock);
+ spin_unlock(&imc_global_refc.lock);
event->hw.idx = -1;
@@ -1533,10 +1529,10 @@ static int init_nest_pmu_ref(void)
i = 0;
for_each_node(nid) {
/*
- * Mutex lock to avoid races while tracking the number of
+ * Take the lock to avoid races while tracking the number of
* sessions using the chip's nest pmu units.
*/
- mutex_init(&nest_imc_refc[i].lock);
+ spin_lock_init(&nest_imc_refc[i].lock);
/*
* Loop to init the "id" with the node_id. Variable "i" initialized to
@@ -1633,7 +1629,7 @@ static void imc_common_mem_free(struct imc_pmu *pmu_ptr)
static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr)
{
if (pmu_ptr->domain == IMC_DOMAIN_NEST) {
- mutex_lock(&nest_init_lock);
+ spin_lock(&nest_init_lock);
if (nest_pmus == 1) {
cpuhp_remove_state(CPUHP_AP_PERF_POWERPC_NEST_IMC_ONLINE);
kfree(nest_imc_refc);
@@ -1643,7 +1639,7 @@ static void imc_common_cpuhp_mem_free(struct imc_pmu *pmu_ptr)
if (nest_pmus > 0)
nest_pmus--;
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
}
/* Free core_imc memory */
@@ -1800,11 +1796,11 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
* rest. To handle the cpuhotplug callback unregister, we track
* the number of nest pmus in "nest_pmus".
*/
- mutex_lock(&nest_init_lock);
+ spin_lock(&nest_init_lock);
if (nest_pmus == 0) {
ret = init_nest_pmu_ref();
if (ret) {
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
kfree(per_nest_pmu_arr);
per_nest_pmu_arr = NULL;
goto err_free_mem;
@@ -1812,7 +1808,7 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
/* Register for cpu hotplug notification. */
ret = nest_pmu_cpumask_init();
if (ret) {
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
kfree(nest_imc_refc);
kfree(per_nest_pmu_arr);
per_nest_pmu_arr = NULL;
@@ -1820,7 +1816,7 @@ int init_imc_pmu(struct device_node *parent, struct imc_pmu *pmu_ptr, int pmu_id
}
}
nest_pmus++;
- mutex_unlock(&nest_init_lock);
+ spin_unlock(&nest_init_lock);
break;
case IMC_DOMAIN_CORE:
ret = core_imc_pmu_cpumask_init();
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
87ca4f9efbd7 ("sched/core: Fix use-after-free bug in dup_user_cpus_ptr()")
8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
713a2e21a513 ("sched: Introduce affinity_context")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 87ca4f9efbd7cc649ff43b87970888f2812945b8 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman(a)redhat.com>
Date: Fri, 30 Dec 2022 23:11:19 -0500
Subject: [PATCH] sched/core: Fix use-after-free bug in dup_user_cpus_ptr()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Since commit 07ec77a1d4e8 ("sched: Allow task CPU affinity to be
restricted on asymmetric systems"), the setting and clearing of
user_cpus_ptr are done under pi_lock for arm64 architecture. However,
dup_user_cpus_ptr() accesses user_cpus_ptr without any lock
protection. Since sched_setaffinity() can be invoked from another
process, the process being modified may be undergoing fork() at
the same time. When racing with the clearing of user_cpus_ptr in
__set_cpus_allowed_ptr_locked(), it can lead to user-after-free and
possibly double-free in arm64 kernel.
Commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask") fixes this problem as user_cpus_ptr, once set, will never
be cleared in a task's lifetime. However, this bug was re-introduced
in commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in
do_set_cpus_allowed()") which allows the clearing of user_cpus_ptr in
do_set_cpus_allowed(). This time, it will affect all arches.
Fix this bug by always clearing the user_cpus_ptr of the newly
cloned/forked task before the copying process starts and check the
user_cpus_ptr state of the source task under pi_lock.
Note to stable, this patch won't be applicable to stable releases.
Just copy the new dup_user_cpus_ptr() function over.
Fixes: 07ec77a1d4e8 ("sched: Allow task CPU affinity to be restricted on asymmetric systems")
Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
Reported-by: David Wang 王标 <wangbiao3(a)xiaomi.com>
Signed-off-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Reviewed-by: Peter Zijlstra <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20221231041120.440785-2-longman@redhat.com
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 965d813c28ad..f9f6e5413dcf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2612,19 +2612,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
int node)
{
+ cpumask_t *user_mask;
unsigned long flags;
- if (!src->user_cpus_ptr)
+ /*
+ * Always clear dst->user_cpus_ptr first as their user_cpus_ptr's
+ * may differ by now due to racing.
+ */
+ dst->user_cpus_ptr = NULL;
+
+ /*
+ * This check is racy and losing the race is a valid situation.
+ * It is not worth the extra overhead of taking the pi_lock on
+ * every fork/clone.
+ */
+ if (data_race(!src->user_cpus_ptr))
return 0;
- dst->user_cpus_ptr = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
- if (!dst->user_cpus_ptr)
+ user_mask = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
+ if (!user_mask)
return -ENOMEM;
- /* Use pi_lock to protect content of user_cpus_ptr */
+ /*
+ * Use pi_lock to protect content of user_cpus_ptr
+ *
+ * Though unlikely, user_cpus_ptr can be reset to NULL by a concurrent
+ * do_set_cpus_allowed().
+ */
raw_spin_lock_irqsave(&src->pi_lock, flags);
- cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+ if (src->user_cpus_ptr) {
+ swap(dst->user_cpus_ptr, user_mask);
+ cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+ }
raw_spin_unlock_irqrestore(&src->pi_lock, flags);
+
+ if (unlikely(user_mask))
+ kfree(user_mask);
+
return 0;
}
Syzkaller reports warning in submit_bio_checks in 5.10 stable releases.
The issue has been fixed by the following patch which can be cleanly
applied to 5.10 branch.
bio->bi_bdev from original patch is not compatible with 5.10 so adapted
the print message to work with 5.10 cleanly.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
142e821f68cf ("iommu/mediatek-v1: Fix an error handling path in mtk_iommu_v1_probe()")
ac304c070c54 ("iommu/mediatek-v1: Add error handle for mtk_iommu_probe")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 142e821f68cf5da79ce722cb9c1323afae30e185 Mon Sep 17 00:00:00 2001
From: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Date: Mon, 19 Dec 2022 19:06:22 +0100
Subject: [PATCH] iommu/mediatek-v1: Fix an error handling path in
mtk_iommu_v1_probe()
A clk, prepared and enabled in mtk_iommu_v1_hw_init(), is not released in
the error handling path of mtk_iommu_v1_probe().
Add the corresponding clk_disable_unprepare(), as already done in the
remove function.
Fixes: b17336c55d89 ("iommu/mediatek: add support for mtk iommu generation one HW")
Signed-off-by: Christophe JAILLET <christophe.jaillet(a)wanadoo.fr>
Reviewed-by: Yong Wu <yong.wu(a)mediatek.com>
Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno(a)collabora.com>
Reviewed-by: Matthias Brugger <matthias.bgg(a)gmail.com>
Link: https://lore.kernel.org/r/593e7b7d97c6e064b29716b091a9d4fd122241fb.16714731…
Signed-off-by: Joerg Roedel <jroedel(a)suse.de>
diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c
index 69682ee068d2..ca581ff1c769 100644
--- a/drivers/iommu/mtk_iommu_v1.c
+++ b/drivers/iommu/mtk_iommu_v1.c
@@ -683,7 +683,7 @@ static int mtk_iommu_v1_probe(struct platform_device *pdev)
ret = iommu_device_sysfs_add(&data->iommu, &pdev->dev, NULL,
dev_name(&pdev->dev));
if (ret)
- return ret;
+ goto out_clk_unprepare;
ret = iommu_device_register(&data->iommu, &mtk_iommu_v1_ops, dev);
if (ret)
@@ -698,6 +698,8 @@ out_dev_unreg:
iommu_device_unregister(&data->iommu);
out_sysfs_remove:
iommu_device_sysfs_remove(&data->iommu);
+out_clk_unprepare:
+ clk_disable_unprepare(data->bclk);
return ret;
}
Hello,
I'm running Ubuntu 20.10 and experienced a regression when trying
kernel 6.1.5 that was not present in previous kernels such as 6.0.19
or 6.0.9
I have two monitors attached using DisplayPort MST (daisy chaining the
displays).
Normally, this works without issue and both monitors are detected and
display properly.
With kernel 6.1.5, the second monitor (the monitor that's not directly
connected to the computer) was detected but would not display. I
didn't spend much time troubleshooting the issue and instead reverted
to 6.0.19 where it again worked without issue. The first monitor
worked without issue.
Let me know what steps I can take to help identify the root cause.
-Matt
Syzkaller reports suspicious RCU usage in xfrm_set_default in 5.10 stable
releases. The problem has been fixed by the following patch which can be
cleanly applied to 5.10 branch.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
3b50142d8528 ("MAINTAINERS: sort field names for all entries")
4400b7d68f6e ("MAINTAINERS: sort entries by entry name")
b032227c6293 ("Merge tag 'nios2-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
3b50142d8528 ("MAINTAINERS: sort field names for all entries")
4400b7d68f6e ("MAINTAINERS: sort entries by entry name")
b032227c6293 ("Merge tag 'nios2-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
3b50142d8528 ("MAINTAINERS: sort field names for all entries")
4400b7d68f6e ("MAINTAINERS: sort entries by entry name")
b032227c6293 ("Merge tag 'nios2-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/lftan/nios2")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
115d9d77bb0f ("mm: Always release pages to the buddy allocator in memblock_free_late().")
16802e55dea9 ("memblock tests: Add skeleton of the memblock simulator")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 115d9d77bb0f9152c60b6e8646369fa7f6167593 Mon Sep 17 00:00:00 2001
From: Aaron Thompson <dev(a)aaront.org>
Date: Fri, 6 Jan 2023 22:22:44 +0000
Subject: [PATCH] mm: Always release pages to the buddy allocator in
memblock_free_late().
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, memblock_free_pages()
only releases pages to the buddy allocator if they are not in the
deferred range. This is correct for free pages (as defined by
for_each_free_mem_pfn_range_in_zone()) because free pages in the
deferred range will be initialized and released as part of the deferred
init process. memblock_free_pages() is called by memblock_free_late(),
which is used to free reserved ranges after memblock_free_all() has
run. All pages in reserved ranges have been initialized at that point,
and accordingly, those pages are not touched by the deferred init
process. This means that currently, if the pages that
memblock_free_late() intends to release are in the deferred range, they
will never be released to the buddy allocator. They will forever be
reserved.
In addition, memblock_free_pages() calls kmsan_memblock_free_pages(),
which is also correct for free pages but is not correct for reserved
pages. KMSAN metadata for reserved pages is initialized by
kmsan_init_shadow(), which runs shortly before memblock_free_all().
For both of these reasons, memblock_free_pages() should only be called
for free pages, and memblock_free_late() should call __free_pages_core()
directly instead.
One case where this issue can occur in the wild is EFI boot on
x86_64. The x86 EFI code reserves all EFI boot services memory ranges
via memblock_reserve() and frees them later via memblock_free_late()
(efi_reserve_boot_services() and efi_free_boot_services(),
respectively). If any of those ranges happens to fall within the
deferred init range, the pages will not be released and that memory will
be unavailable.
For example, on an Amazon EC2 t3.micro VM (1 GB) booting via EFI:
v6.2-rc2:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 178867
v6.2-rc2 + patch:
# grep -E 'Node|spanned|present|managed' /proc/zoneinfo
Node 0, zone DMA
spanned 4095
present 3999
managed 3840
Node 0, zone DMA32
spanned 246652
present 245868
managed 222816 # +43,949 pages
Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
Signed-off-by: Aaron Thompson <dev(a)aaront.org>
Link: https://lore.kernel.org/r/01010185892de53e-e379acfb-7044-4b24-b30a-e2657c1b…
Signed-off-by: Mike Rapoport (IBM) <rppt(a)kernel.org>
diff --git a/mm/memblock.c b/mm/memblock.c
index d036c7861310..685e30e6d27c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1640,7 +1640,13 @@ void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
end = PFN_DOWN(base + size);
for (; cursor < end; cursor++) {
- memblock_free_pages(pfn_to_page(cursor), cursor, 0);
+ /*
+ * Reserved pages are always initialized by the end of
+ * memblock_free_all() (by memmap_init() and, if deferred
+ * initialization is enabled, memmap_init_reserved_pages()), so
+ * these pages can be released directly to the buddy allocator.
+ */
+ __free_pages_core(pfn_to_page(cursor), 0);
totalram_pages_inc();
}
}
diff --git a/tools/testing/memblock/internal.h b/tools/testing/memblock/internal.h
index fdb7f5db7308..85973e55489e 100644
--- a/tools/testing/memblock/internal.h
+++ b/tools/testing/memblock/internal.h
@@ -15,6 +15,10 @@ bool mirrored_kernelcore = false;
struct page {};
+void __free_pages_core(struct page *page, unsigned int order)
+{
+}
+
void memblock_free_pages(struct page *page, unsigned long pfn,
unsigned int order)
{
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
544d163d659d ("io_uring: lock overflowing for IOPOLL")
a8cf95f93610 ("io_uring: fix overflow handling regression")
fa18fa2272c7 ("io_uring: inline __io_req_complete_put()")
f9d567c75ec2 ("io_uring: inline __io_req_complete_post()")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 544d163d659d45a206d8929370d5a2984e546cb7 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Thu, 12 Jan 2023 13:08:56 +0000
Subject: [PATCH] io_uring: lock overflowing for IOPOLL
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
syzbot reports an issue with overflow filling for IOPOLL:
WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
io_fill_cqe_req io_uring/io_uring.h:168 [inline]
io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863
There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.
Reported-and-tested-by: syzbot+6805087452d72929404e(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 8227af2e1c0f..9c3ddd46a1ad 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1062,7 +1062,11 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
continue;
req->cqe.flags = io_put_kbuf(req, 0);
- io_fill_cqe_req(req->ctx, req);
+ if (unlikely(!__io_fill_cqe_req(ctx, req))) {
+ spin_lock(&ctx->completion_lock);
+ io_req_cqe_overflow(req);
+ spin_unlock(&ctx->completion_lock);
+ }
}
if (unlikely(!nr_events))
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
6e5aedb9324a ("io_uring/poll: attempt request issue after racy poll wakeup")
443e57550670 ("io_uring: combine poll tw handlers")
047b6aef0966 ("io_uring: remove ctx variable in io_poll_check_events")
9b8c54755a2b ("io_uring: add io_aux_cqe which allows deferred completion")
973fc83f3a94 ("io_uring: defer all io_req_complete_failed")
c06c6c5d2767 ("io_uring: always lock in io_apoll_task_func")
1bec951c3809 ("io_uring: iopoll protect complete_post")
fa18fa2272c7 ("io_uring: inline __io_req_complete_put()")
e276ae344a77 ("io_uring: hold locks for io_req_complete_failed")
f9d567c75ec2 ("io_uring: inline __io_req_complete_post()")
cd42a53d25d4 ("io_uring/poll: remove outdated comments of caching")
e2ad599d1ed3 ("io_uring: allow multishot recv CQEs to overflow")
515e26961295 ("io_uring: revert "io_uring fix multishot accept ordering"")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 6e5aedb9324aab1c14a23fae3d8eeb64a679c20e Mon Sep 17 00:00:00 2001
From: Jens Axboe <axboe(a)kernel.dk>
Date: Tue, 10 Jan 2023 10:44:37 -0700
Subject: [PATCH] io_uring/poll: attempt request issue after racy poll wakeup
If we have multiple requests waiting on the same target poll waitqueue,
then it's quite possible to get a request triggered and get disappointed
in not being able to make any progress with it. If we race in doing so,
we'll potentially leave the poll request on the internal tables, but
removed from the waitqueue. That means that any subsequent trigger of
the poll waitqueue will not kick that request into action, causing an
application to potentially wait for completion of a request that will
never happen.
Fix this by adding a new poll return state, IOU_POLL_REISSUE. Rather
than have complicated logic for how to re-arm a given type of request,
just punt it for a reissue.
While in there, move the 'ret' variable to the only section where it
gets used. This avoids confusion the scope of it.
Cc: stable(a)vger.kernel.org
Fixes: eb0089d629ba ("io_uring: single shot poll removal optimisation")
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/poll.c b/io_uring/poll.c
index cf6a70bd54e0..32e5fc8365e6 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -223,21 +223,22 @@ enum {
IOU_POLL_DONE = 0,
IOU_POLL_NO_ACTION = 1,
IOU_POLL_REMOVE_POLL_USE_RES = 2,
+ IOU_POLL_REISSUE = 3,
};
/*
* All poll tw should go through this. Checks for poll events, manages
* references, does rewait, etc.
*
- * Returns a negative error on failure. IOU_POLL_NO_ACTION when no action require,
- * which is either spurious wakeup or multishot CQE is served.
- * IOU_POLL_DONE when it's done with the request, then the mask is stored in req->cqe.res.
- * IOU_POLL_REMOVE_POLL_USE_RES indicates to remove multishot poll and that the result
- * is stored in req->cqe.
+ * Returns a negative error on failure. IOU_POLL_NO_ACTION when no action
+ * require, which is either spurious wakeup or multishot CQE is served.
+ * IOU_POLL_DONE when it's done with the request, then the mask is stored in
+ * req->cqe.res. IOU_POLL_REMOVE_POLL_USE_RES indicates to remove multishot
+ * poll and that the result is stored in req->cqe.
*/
static int io_poll_check_events(struct io_kiocb *req, bool *locked)
{
- int v, ret;
+ int v;
/* req->task == current here, checking PF_EXITING is safe */
if (unlikely(req->task->flags & PF_EXITING))
@@ -276,10 +277,15 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)
if (!req->cqe.res) {
struct poll_table_struct pt = { ._key = req->apoll_events };
req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
+ /*
+ * We got woken with a mask, but someone else got to
+ * it first. The above vfs_poll() doesn't add us back
+ * to the waitqueue, so if we get nothing back, we
+ * should be safe and attempt a reissue.
+ */
+ if (unlikely(!req->cqe.res))
+ return IOU_POLL_REISSUE;
}
-
- if ((unlikely(!req->cqe.res)))
- continue;
if (req->apoll_events & EPOLLONESHOT)
return IOU_POLL_DONE;
@@ -294,7 +300,7 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)
return IOU_POLL_REMOVE_POLL_USE_RES;
}
} else {
- ret = io_poll_issue(req, locked);
+ int ret = io_poll_issue(req, locked);
if (ret == IOU_STOP_MULTISHOT)
return IOU_POLL_REMOVE_POLL_USE_RES;
if (ret < 0)
@@ -330,6 +336,9 @@ static void io_poll_task_func(struct io_kiocb *req, bool *locked)
poll = io_kiocb_to_cmd(req, struct io_poll);
req->cqe.res = mangle_poll(req->cqe.res & poll->events);
+ } else if (ret == IOU_POLL_REISSUE) {
+ io_req_task_submit(req, locked);
+ return;
} else if (ret != IOU_POLL_REMOVE_POLL_USE_RES) {
req->cqe.res = ret;
req_set_fail(req);
@@ -342,7 +351,7 @@ static void io_poll_task_func(struct io_kiocb *req, bool *locked)
if (ret == IOU_POLL_REMOVE_POLL_USE_RES)
io_req_task_complete(req, locked);
- else if (ret == IOU_POLL_DONE)
+ else if (ret == IOU_POLL_DONE || ret == IOU_POLL_REISSUE)
io_req_task_submit(req, locked);
else
io_req_defer_failed(req, ret);
Hi! Since kernel 6.0.18 resume from hibernate fails, the system hangs
and a hard reset is necessary. The CPU is a Ryzen 5600G, the system is
Linux From Scratch-11.1.
I found this in the system log:
[...]
Jan 12 19:30:03 LUX kernel: [ 50.248036] amdgpu 0000:30:00.0: [drm]
*ERROR* [CRTC:67:crtc-0] flip_done timed out
Jan 12 19:30:03 LUX kernel: [ 50.248040] amdgpu 0000:30:00.0: [drm]
*ERROR* [CRTC:70:crtc-1] flip_done timed out
Jan 12 19:30:14 LUX kernel: [ 60.488034] amdgpu 0000:30:00.0: [drm]
*ERROR* flip_done timed out
Jan 12 19:30:14 LUX kernel: [ 60.488040] amdgpu 0000:30:00.0: [drm]
*ERROR* [CRTC:67:crtc-0] commit wait timed out
^@^@^@^@^@^@^@^@^@^[...]@^@^@^@^@^@^@^@^@^@^@^@^@^@Jan 12 19:31:20 LUX
syslogd 1.5.1: restart.
[...]
Bisecting the problem turned up this:
~/Downloads/linux-stable-BLFS-11.1> git bisect bad
306df163069e78160e7a534b892c5cd6fefdd537 is the first bad commit
commit 306df163069e78160e7a534b892c5cd6fefdd537
Author: Alex Deucher <alexander.deucher(a)amd.com>
Date: Wed Dec 7 11:08:53 2022 -0500
drm/amdgpu: make display pinning more flexible (v2)
commit 81d0bcf9900932633d270d5bc4a54ff599c6ebdb upstream.
Only apply the static threshold for Stoney and Carrizo.
This hardware has certain requirements that don't allow
mixing of GTT and VRAM. Newer asics do not have these
requirements so we should be able to be more flexible
with where buffers end up.
[...]
Let me know if you need more info. Thanks.
Rainer Fiebig
--
The truth always turns out to be simpler than you thought.
Richard Feynman
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
52531258318e ("drm/virtio: Fix GEM handle creation UAF")
897b4d1acaf5 ("drm/virtio: implement blob resources: resource create blob ioctl")
16845c5d5409 ("drm/virtio: implement blob resources: implement vram object")
6076a9711dc5 ("drm/virtio: implement blob resources: probe for host visible region")
6815cfe602d0 ("drm/virtio: implement blob resources: probe for the feature.")
30172efbfb84 ("drm/virtio: blob prep: refactor getting pages and attaching backing")
deb2464e4c6d ("drm/virtio: report uuid in debugfs")
c84adb304c10 ("drm/virtio: Support virtgpu exported resources")
0a19b068acc4 ("Merge tag 'drm-misc-next-2020-06-19' of git://anongit.freedesktop.org/drm/drm-misc into drm-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 52531258318ed59a2dc5a43df2eaf0eb1d65438e Mon Sep 17 00:00:00 2001
From: Rob Clark <robdclark(a)chromium.org>
Date: Fri, 16 Dec 2022 15:33:55 -0800
Subject: [PATCH] drm/virtio: Fix GEM handle creation UAF
Userspace can guess the handle value and try to race GEM object creation
with handle close, resulting in a use-after-free if we dereference the
object after dropping the handle's reference. For that reason, dropping
the handle's reference must be done *after* we are done dereferencing
the object.
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
Reviewed-by: Chia-I Wu <olvaffe(a)gmail.com>
Fixes: 62fb7a5e1096 ("virtio-gpu: add 3d/virgl support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dmitry Osipenko <dmitry.osipenko(a)collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216233355.542197-2-robdc…
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 5d05093014ac..9f4a90493aea 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -358,10 +358,18 @@ static int virtio_gpu_resource_create_ioctl(struct drm_device *dev, void *data,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc->res_handle = qobj->hw_res_handle; /* similiar to a VM address */
rc->bo_handle = handle;
+
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
@@ -723,11 +731,18 @@ static int virtio_gpu_resource_create_blob_ioctl(struct drm_device *dev,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc_blob->res_handle = bo->hw_res_handle;
rc_blob->bo_handle = handle;
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
52531258318e ("drm/virtio: Fix GEM handle creation UAF")
897b4d1acaf5 ("drm/virtio: implement blob resources: resource create blob ioctl")
16845c5d5409 ("drm/virtio: implement blob resources: implement vram object")
6076a9711dc5 ("drm/virtio: implement blob resources: probe for host visible region")
6815cfe602d0 ("drm/virtio: implement blob resources: probe for the feature.")
30172efbfb84 ("drm/virtio: blob prep: refactor getting pages and attaching backing")
deb2464e4c6d ("drm/virtio: report uuid in debugfs")
c84adb304c10 ("drm/virtio: Support virtgpu exported resources")
0a19b068acc4 ("Merge tag 'drm-misc-next-2020-06-19' of git://anongit.freedesktop.org/drm/drm-misc into drm-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 52531258318ed59a2dc5a43df2eaf0eb1d65438e Mon Sep 17 00:00:00 2001
From: Rob Clark <robdclark(a)chromium.org>
Date: Fri, 16 Dec 2022 15:33:55 -0800
Subject: [PATCH] drm/virtio: Fix GEM handle creation UAF
Userspace can guess the handle value and try to race GEM object creation
with handle close, resulting in a use-after-free if we dereference the
object after dropping the handle's reference. For that reason, dropping
the handle's reference must be done *after* we are done dereferencing
the object.
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
Reviewed-by: Chia-I Wu <olvaffe(a)gmail.com>
Fixes: 62fb7a5e1096 ("virtio-gpu: add 3d/virgl support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dmitry Osipenko <dmitry.osipenko(a)collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216233355.542197-2-robdc…
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 5d05093014ac..9f4a90493aea 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -358,10 +358,18 @@ static int virtio_gpu_resource_create_ioctl(struct drm_device *dev, void *data,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc->res_handle = qobj->hw_res_handle; /* similiar to a VM address */
rc->bo_handle = handle;
+
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
@@ -723,11 +731,18 @@ static int virtio_gpu_resource_create_blob_ioctl(struct drm_device *dev,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc_blob->res_handle = bo->hw_res_handle;
rc_blob->bo_handle = handle;
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
52531258318e ("drm/virtio: Fix GEM handle creation UAF")
897b4d1acaf5 ("drm/virtio: implement blob resources: resource create blob ioctl")
16845c5d5409 ("drm/virtio: implement blob resources: implement vram object")
6076a9711dc5 ("drm/virtio: implement blob resources: probe for host visible region")
6815cfe602d0 ("drm/virtio: implement blob resources: probe for the feature.")
30172efbfb84 ("drm/virtio: blob prep: refactor getting pages and attaching backing")
deb2464e4c6d ("drm/virtio: report uuid in debugfs")
c84adb304c10 ("drm/virtio: Support virtgpu exported resources")
0a19b068acc4 ("Merge tag 'drm-misc-next-2020-06-19' of git://anongit.freedesktop.org/drm/drm-misc into drm-next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 52531258318ed59a2dc5a43df2eaf0eb1d65438e Mon Sep 17 00:00:00 2001
From: Rob Clark <robdclark(a)chromium.org>
Date: Fri, 16 Dec 2022 15:33:55 -0800
Subject: [PATCH] drm/virtio: Fix GEM handle creation UAF
Userspace can guess the handle value and try to race GEM object creation
with handle close, resulting in a use-after-free if we dereference the
object after dropping the handle's reference. For that reason, dropping
the handle's reference must be done *after* we are done dereferencing
the object.
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
Reviewed-by: Chia-I Wu <olvaffe(a)gmail.com>
Fixes: 62fb7a5e1096 ("virtio-gpu: add 3d/virgl support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dmitry Osipenko <dmitry.osipenko(a)collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216233355.542197-2-robdc…
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 5d05093014ac..9f4a90493aea 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -358,10 +358,18 @@ static int virtio_gpu_resource_create_ioctl(struct drm_device *dev, void *data,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc->res_handle = qobj->hw_res_handle; /* similiar to a VM address */
rc->bo_handle = handle;
+
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
@@ -723,11 +731,18 @@ static int virtio_gpu_resource_create_blob_ioctl(struct drm_device *dev,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc_blob->res_handle = bo->hw_res_handle;
rc_blob->bo_handle = handle;
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
52531258318e ("drm/virtio: Fix GEM handle creation UAF")
897b4d1acaf5 ("drm/virtio: implement blob resources: resource create blob ioctl")
16845c5d5409 ("drm/virtio: implement blob resources: implement vram object")
6076a9711dc5 ("drm/virtio: implement blob resources: probe for host visible region")
6815cfe602d0 ("drm/virtio: implement blob resources: probe for the feature.")
30172efbfb84 ("drm/virtio: blob prep: refactor getting pages and attaching backing")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 52531258318ed59a2dc5a43df2eaf0eb1d65438e Mon Sep 17 00:00:00 2001
From: Rob Clark <robdclark(a)chromium.org>
Date: Fri, 16 Dec 2022 15:33:55 -0800
Subject: [PATCH] drm/virtio: Fix GEM handle creation UAF
Userspace can guess the handle value and try to race GEM object creation
with handle close, resulting in a use-after-free if we dereference the
object after dropping the handle's reference. For that reason, dropping
the handle's reference must be done *after* we are done dereferencing
the object.
Signed-off-by: Rob Clark <robdclark(a)chromium.org>
Reviewed-by: Chia-I Wu <olvaffe(a)gmail.com>
Fixes: 62fb7a5e1096 ("virtio-gpu: add 3d/virgl support")
Cc: stable(a)vger.kernel.org
Signed-off-by: Dmitry Osipenko <dmitry.osipenko(a)collabora.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221216233355.542197-2-robdc…
diff --git a/drivers/gpu/drm/virtio/virtgpu_ioctl.c b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
index 5d05093014ac..9f4a90493aea 100644
--- a/drivers/gpu/drm/virtio/virtgpu_ioctl.c
+++ b/drivers/gpu/drm/virtio/virtgpu_ioctl.c
@@ -358,10 +358,18 @@ static int virtio_gpu_resource_create_ioctl(struct drm_device *dev, void *data,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc->res_handle = qobj->hw_res_handle; /* similiar to a VM address */
rc->bo_handle = handle;
+
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
@@ -723,11 +731,18 @@ static int virtio_gpu_resource_create_blob_ioctl(struct drm_device *dev,
drm_gem_object_release(obj);
return ret;
}
- drm_gem_object_put(obj);
rc_blob->res_handle = bo->hw_res_handle;
rc_blob->bo_handle = handle;
+ /*
+ * The handle owns the reference now. But we must drop our
+ * remaining reference *after* we no longer need to dereference
+ * the obj. Otherwise userspace could guess the handle and
+ * race closing it from another thread.
+ */
+ drm_gem_object_put(obj);
+
return 0;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
ae9dcb91c606 ("net: stmmac: add aux timestamps fifo clearance wait")
f4da56529da6 ("net: stmmac: Add support for external trigger timestamping")
8532f613bc78 ("net: stmmac: introduce MSI Interrupt routines for mac, safety, RX & TX")
7e1c520c0d20 ("net: stmmac: introduce DMA interrupt status masking per traffic direction")
341f67e424e5 ("net: stmmac: Add hardware supported cross-timestamp")
76da35dc99af ("stmmac: intel: Add PSE and PCH PTP clock source selection")
b4d45aee6635 ("net: stmmac: add platform level clocks management")
5ec55823438e ("net: stmmac: add clocks management for gmac driver")
7310fe538ea5 ("stmmac: intel: add pcs-xpcs for Intel mGbE controller")
20e07e2c3cf3 ("net: stmmac: Add PCI bus info to ethtool driver query output")
7cfc4486e7ea ("stmmac: intel: Configure EHL PSE0 GbE and PSE1 GbE to 32 bits DMA addressing")
bff6f1db91e3 ("stmmac: intel: change all EHL/TGL to auto detect phy addr")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From ae9dcb91c6069e20b3b9505d79cbc89fd6e086f5 Mon Sep 17 00:00:00 2001
From: Noor Azura Ahmad Tarmizi <noor.azura.ahmad.tarmizi(a)intel.com>
Date: Wed, 11 Jan 2023 13:02:00 +0800
Subject: [PATCH] net: stmmac: add aux timestamps fifo clearance wait
Add timeout polling wait for auxiliary timestamps snapshot FIFO clear bit
(ATSFC) to clear. This is to ensure no residue fifo value is being read
erroneously.
Fixes: f4da56529da6 ("net: stmmac: Add support for external trigger timestamping")
Cc: <stable(a)vger.kernel.org> # 5.10.x
Signed-off-by: Noor Azura Ahmad Tarmizi <noor.azura.ahmad.tarmizi(a)intel.com>
Link: https://lore.kernel.org/r/20230111050200.2130-1-noor.azura.ahmad.tarmizi@in…
Signed-off-by: Jakub Kicinski <kuba(a)kernel.org>
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
index fc06ddeac0d5..b4388ca8d211 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_ptp.c
@@ -210,7 +210,10 @@ static int stmmac_enable(struct ptp_clock_info *ptp,
}
writel(acr_value, ptpaddr + PTP_ACR);
mutex_unlock(&priv->aux_ts_lock);
- ret = 0;
+ /* wait for auxts fifo clear to finish */
+ ret = readl_poll_timeout(ptpaddr + PTP_ACR, acr_value,
+ !(acr_value & PTP_ACR_ATSFC),
+ 10, 10000);
break;
default:
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
87ca4f9efbd7 ("sched/core: Fix use-after-free bug in dup_user_cpus_ptr()")
8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
713a2e21a513 ("sched: Introduce affinity_context")
d664e399128b ("sched: Fix missing prototype warnings")
e81daa7b6489 ("sched/headers: Reorganize, clean up and optimize kernel/sched/build_utility.c dependencies")
0dda4eeb4849 ("sched/headers: Reorganize, clean up and optimize kernel/sched/build_policy.c dependencies")
c4ad6fcb67c4 ("sched/headers: Reorganize, clean up and optimize kernel/sched/fair.c dependencies")
e66f6481a8c7 ("sched/headers: Reorganize, clean up and optimize kernel/sched/core.c dependencies")
b9e9c6ca6e54 ("sched/headers: Standardize kernel/sched/sched.h header dependencies")
f96eca432015 ("sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there")
801c14195510 ("sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there")
d90a2f160a1c ("sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h")
99cf983cc8bc ("sched/preempt: Add PREEMPT_DYNAMIC using static keys")
33c64734be34 ("sched/preempt: Decouple HAVE_PREEMPT_DYNAMIC from GENERIC_ENTRY")
4624a14f4daa ("sched/preempt: Simplify irqentry_exit_cond_resched() callers")
8a69fe0be143 ("sched/preempt: Refactor sched_dynamic_update()")
4c7485584d48 ("sched/preempt: Move PREEMPT_DYNAMIC logic later")
6ae71436cda7 ("Merge tag 'sched_core_for_v5.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 87ca4f9efbd7cc649ff43b87970888f2812945b8 Mon Sep 17 00:00:00 2001
From: Waiman Long <longman(a)redhat.com>
Date: Fri, 30 Dec 2022 23:11:19 -0500
Subject: [PATCH] sched/core: Fix use-after-free bug in dup_user_cpus_ptr()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Since commit 07ec77a1d4e8 ("sched: Allow task CPU affinity to be
restricted on asymmetric systems"), the setting and clearing of
user_cpus_ptr are done under pi_lock for arm64 architecture. However,
dup_user_cpus_ptr() accesses user_cpus_ptr without any lock
protection. Since sched_setaffinity() can be invoked from another
process, the process being modified may be undergoing fork() at
the same time. When racing with the clearing of user_cpus_ptr in
__set_cpus_allowed_ptr_locked(), it can lead to user-after-free and
possibly double-free in arm64 kernel.
Commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask") fixes this problem as user_cpus_ptr, once set, will never
be cleared in a task's lifetime. However, this bug was re-introduced
in commit 851a723e45d1 ("sched: Always clear user_cpus_ptr in
do_set_cpus_allowed()") which allows the clearing of user_cpus_ptr in
do_set_cpus_allowed(). This time, it will affect all arches.
Fix this bug by always clearing the user_cpus_ptr of the newly
cloned/forked task before the copying process starts and check the
user_cpus_ptr state of the source task under pi_lock.
Note to stable, this patch won't be applicable to stable releases.
Just copy the new dup_user_cpus_ptr() function over.
Fixes: 07ec77a1d4e8 ("sched: Allow task CPU affinity to be restricted on asymmetric systems")
Fixes: 851a723e45d1 ("sched: Always clear user_cpus_ptr in do_set_cpus_allowed()")
Reported-by: David Wang 王标 <wangbiao3(a)xiaomi.com>
Signed-off-by: Waiman Long <longman(a)redhat.com>
Signed-off-by: Ingo Molnar <mingo(a)kernel.org>
Reviewed-by: Peter Zijlstra <peterz(a)infradead.org>
Cc: stable(a)vger.kernel.org
Link: https://lore.kernel.org/r/20221231041120.440785-2-longman@redhat.com
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 965d813c28ad..f9f6e5413dcf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2612,19 +2612,43 @@ void do_set_cpus_allowed(struct task_struct *p, const struct cpumask *new_mask)
int dup_user_cpus_ptr(struct task_struct *dst, struct task_struct *src,
int node)
{
+ cpumask_t *user_mask;
unsigned long flags;
- if (!src->user_cpus_ptr)
+ /*
+ * Always clear dst->user_cpus_ptr first as their user_cpus_ptr's
+ * may differ by now due to racing.
+ */
+ dst->user_cpus_ptr = NULL;
+
+ /*
+ * This check is racy and losing the race is a valid situation.
+ * It is not worth the extra overhead of taking the pi_lock on
+ * every fork/clone.
+ */
+ if (data_race(!src->user_cpus_ptr))
return 0;
- dst->user_cpus_ptr = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
- if (!dst->user_cpus_ptr)
+ user_mask = kmalloc_node(cpumask_size(), GFP_KERNEL, node);
+ if (!user_mask)
return -ENOMEM;
- /* Use pi_lock to protect content of user_cpus_ptr */
+ /*
+ * Use pi_lock to protect content of user_cpus_ptr
+ *
+ * Though unlikely, user_cpus_ptr can be reset to NULL by a concurrent
+ * do_set_cpus_allowed().
+ */
raw_spin_lock_irqsave(&src->pi_lock, flags);
- cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+ if (src->user_cpus_ptr) {
+ swap(dst->user_cpus_ptr, user_mask);
+ cpumask_copy(dst->user_cpus_ptr, src->user_cpus_ptr);
+ }
raw_spin_unlock_irqrestore(&src->pi_lock, flags);
+
+ if (unlikely(user_mask))
+ kfree(user_mask);
+
return 0;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
544d163d659d ("io_uring: lock overflowing for IOPOLL")
a8cf95f93610 ("io_uring: fix overflow handling regression")
fa18fa2272c7 ("io_uring: inline __io_req_complete_put()")
f9d567c75ec2 ("io_uring: inline __io_req_complete_post()")
52120f0fadcb ("io_uring: add allow_overflow to io_post_aux_cqe")
e6130eba8a84 ("io_uring: add support for passing fixed file descriptors")
253993210bd8 ("io_uring: introduce locking helpers for CQE posting")
305bef988708 ("io_uring: hide eventfd assumptions in eventfd paths")
d9dee4302a7c ("io_uring: remove ->flush_cqes optimisation")
a830ffd28780 ("io_uring: move io_eventfd_signal()")
9046c6415be6 ("io_uring: reshuffle io_uring/io_uring.h")
d142c3ec8d16 ("io_uring: remove extra io_commit_cqring()")
68494a65d0e2 ("io_uring: introduce io_req_cqe_overflow()")
faf88dde060f ("io_uring: don't inline __io_get_cqe()")
d245bca6375b ("io_uring: don't expose io_fill_cqe_aux()")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
aa1e90f64ee5 ("io_uring: move small helpers to headers")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 544d163d659d45a206d8929370d5a2984e546cb7 Mon Sep 17 00:00:00 2001
From: Pavel Begunkov <asml.silence(a)gmail.com>
Date: Thu, 12 Jan 2023 13:08:56 +0000
Subject: [PATCH] io_uring: lock overflowing for IOPOLL
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
syzbot reports an issue with overflow filling for IOPOLL:
WARNING: CPU: 0 PID: 28 at io_uring/io_uring.c:734 io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
CPU: 0 PID: 28 Comm: kworker/u4:1 Not tainted 6.2.0-rc3-syzkaller-16369-g358a161a6a9e #0
Workqueue: events_unbound io_ring_exit_work
Call trace:
io_cqring_event_overflow+0x1c0/0x230 io_uring/io_uring.c:734
io_req_cqe_overflow+0x5c/0x70 io_uring/io_uring.c:773
io_fill_cqe_req io_uring/io_uring.h:168 [inline]
io_do_iopoll+0x474/0x62c io_uring/rw.c:1065
io_iopoll_try_reap_events+0x6c/0x108 io_uring/io_uring.c:1513
io_uring_try_cancel_requests+0x13c/0x258 io_uring/io_uring.c:3056
io_ring_exit_work+0xec/0x390 io_uring/io_uring.c:2869
process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
worker_thread+0x340/0x610 kernel/workqueue.c:2436
kthread+0x12c/0x158 kernel/kthread.c:376
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:863
There is no real problem for normal IOPOLL as flush is also called with
uring_lock taken, but it's getting more complicated for IOPOLL|SQPOLL,
for which __io_cqring_overflow_flush() happens from the CQ waiting path.
Reported-and-tested-by: syzbot+6805087452d72929404e(a)syzkaller.appspotmail.com
Cc: stable(a)vger.kernel.org # 5.10+
Signed-off-by: Pavel Begunkov <asml.silence(a)gmail.com>
Signed-off-by: Jens Axboe <axboe(a)kernel.dk>
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 8227af2e1c0f..9c3ddd46a1ad 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -1062,7 +1062,11 @@ int io_do_iopoll(struct io_ring_ctx *ctx, bool force_nonspin)
continue;
req->cqe.flags = io_put_kbuf(req, 0);
- io_fill_cqe_req(req->ctx, req);
+ if (unlikely(!__io_fill_cqe_req(ctx, req))) {
+ spin_lock(&ctx->completion_lock);
+ io_req_cqe_overflow(req);
+ spin_unlock(&ctx->completion_lock);
+ }
}
if (unlikely(!nr_events))
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
d391c5827107 ("drivers/firmware: move x86 Generic System Framebuffers support")
25519d683442 ("ima: generalize x86/EFI arch glue for other EFI architectures")
88e9a5b7965c ("efi/fake_mem: arrange for a resource entry per efi_fake_mem instance")
4d0a4388ccdd ("Merge branch 'efi/urgent' into efi/core, to pick up fixes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
d391c5827107 ("drivers/firmware: move x86 Generic System Framebuffers support")
25519d683442 ("ima: generalize x86/EFI arch glue for other EFI architectures")
88e9a5b7965c ("efi/fake_mem: arrange for a resource entry per efi_fake_mem instance")
4d0a4388ccdd ("Merge branch 'efi/urgent' into efi/core, to pick up fixes")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
d391c5827107 ("drivers/firmware: move x86 Generic System Framebuffers support")
25519d683442 ("ima: generalize x86/EFI arch glue for other EFI architectures")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
703c13fe3c9a ("efi: fix NULL-deref in init error path")
1fff234de2b6 ("efi: x86: Move EFI runtime map sysfs code to arch/x86")
8dfac4d8ad27 ("efi: runtime-maps: Clarify purpose and enable by default for kexec")
4059ba656ce5 ("efi: memmap: Move EFI fake memmap support into x86 arch tree")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 703c13fe3c9af557d312f5895ed6a5fda2711104 Mon Sep 17 00:00:00 2001
From: Johan Hovold <johan+linaro(a)kernel.org>
Date: Mon, 19 Dec 2022 10:10:04 +0100
Subject: [PATCH] efi: fix NULL-deref in init error path
In cases where runtime services are not supported or have been disabled,
the runtime services workqueue will never have been allocated.
Do not try to destroy the workqueue unconditionally in the unlikely
event that EFI initialisation fails to avoid dereferencing a NULL
pointer.
Fixes: 98086df8b70c ("efi: add missed destroy_workqueue when efisubsys_init fails")
Cc: stable(a)vger.kernel.org
Cc: Li Heng <liheng40(a)huawei.com>
Signed-off-by: Johan Hovold <johan+linaro(a)kernel.org>
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 09716eebe8ac..a2b0cbc8741c 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -394,8 +394,8 @@ static int __init efisubsys_init(void)
efi_kobj = kobject_create_and_add("efi", firmware_kobj);
if (!efi_kobj) {
pr_err("efi: Firmware registration failed.\n");
- destroy_workqueue(efi_rts_wq);
- return -ENOMEM;
+ error = -ENOMEM;
+ goto err_destroy_wq;
}
if (efi_rt_services_supported(EFI_RT_SUPPORTED_GET_VARIABLE |
@@ -443,7 +443,10 @@ static int __init efisubsys_init(void)
err_put:
kobject_put(efi_kobj);
efi_kobj = NULL;
- destroy_workqueue(efi_rts_wq);
+err_destroy_wq:
+ if (efi_rts_wq)
+ destroy_workqueue(efi_rts_wq);
+
return error;
}
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
d3f450533bbc ("efi: tpm: Avoid READ_ONCE() for accessing the event log")
047d50aee341 ("efi/tpm: Don't access event->count when it isn't mapped")
c46f3405692d ("tpm: Reserve the TPM final events table")
44038bc514a2 ("tpm: Abstract crypto agile event size calculations")
c8faabfc6f48 ("tpm: add _head suffix to tcg_efi_specid_event and tcg_pcr_event2")
3425d934fc03 ("efi/x86: Handle page faults occurring while running EFI runtime services")
9dbbedaa6171 ("efi: Make efi_rts_work accessible to efi page fault handler")
71e0940d52e1 ("efi: honour memory reservations passed via a linux specific config table")
50a7ca3c6fc8 ("mm: convert return type of handle_mm_fault() caller to vm_fault_t")
3eb420e70d87 ("efi: Use a work queue to invoke EFI Runtime Services")
c90fca951e90 ("Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d3f450533bbcb6dd4d7d59cadc9b61b7321e4ac1 Mon Sep 17 00:00:00 2001
From: Ard Biesheuvel <ardb(a)kernel.org>
Date: Mon, 9 Jan 2023 10:44:31 +0100
Subject: [PATCH] efi: tpm: Avoid READ_ONCE() for accessing the event log
Nathan reports that recent kernels built with LTO will crash when doing
EFI boot using Fedora's GRUB and SHIM. The culprit turns out to be a
misaligned load from the TPM event log, which is annotated with
READ_ONCE(), and under LTO, this gets translated into a LDAR instruction
which does not tolerate misaligned accesses.
Interestingly, this does not happen when booting the same kernel
straight from the UEFI shell, and so the fact that the event log may
appear misaligned in memory may be caused by a bug in GRUB or SHIM.
However, using READ_ONCE() to access firmware tables is slightly unusual
in any case, and here, we only need to ensure that 'event' is not
dereferenced again after it gets unmapped, but this is already taken
care of by the implicit barrier() semantics of the early_memunmap()
call.
Cc: <stable(a)vger.kernel.org>
Cc: Peter Jones <pjones(a)redhat.com>
Cc: Jarkko Sakkinen <jarkko(a)kernel.org>
Cc: Matthew Garrett <mjg59(a)srcf.ucam.org>
Reported-by: Nathan Chancellor <nathan(a)kernel.org>
Tested-by: Nathan Chancellor <nathan(a)kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1782
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/include/linux/tpm_eventlog.h b/include/linux/tpm_eventlog.h
index 20c0ff54b7a0..7d68a5cc5881 100644
--- a/include/linux/tpm_eventlog.h
+++ b/include/linux/tpm_eventlog.h
@@ -198,8 +198,8 @@ static __always_inline int __calc_tpm2_event_size(struct tcg_pcr_event2_head *ev
* The loop below will unmap these fields if the log is larger than
* one page, so save them here for reference:
*/
- count = READ_ONCE(event->count);
- event_type = READ_ONCE(event->event_type);
+ count = event->count;
+ event_type = event->event_type;
/* Verify that it's the log header */
if (event_header->pcr_idx != 0 ||
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
d3f450533bbc ("efi: tpm: Avoid READ_ONCE() for accessing the event log")
047d50aee341 ("efi/tpm: Don't access event->count when it isn't mapped")
c46f3405692d ("tpm: Reserve the TPM final events table")
44038bc514a2 ("tpm: Abstract crypto agile event size calculations")
c8faabfc6f48 ("tpm: add _head suffix to tcg_efi_specid_event and tcg_pcr_event2")
3425d934fc03 ("efi/x86: Handle page faults occurring while running EFI runtime services")
9dbbedaa6171 ("efi: Make efi_rts_work accessible to efi page fault handler")
71e0940d52e1 ("efi: honour memory reservations passed via a linux specific config table")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From d3f450533bbcb6dd4d7d59cadc9b61b7321e4ac1 Mon Sep 17 00:00:00 2001
From: Ard Biesheuvel <ardb(a)kernel.org>
Date: Mon, 9 Jan 2023 10:44:31 +0100
Subject: [PATCH] efi: tpm: Avoid READ_ONCE() for accessing the event log
Nathan reports that recent kernels built with LTO will crash when doing
EFI boot using Fedora's GRUB and SHIM. The culprit turns out to be a
misaligned load from the TPM event log, which is annotated with
READ_ONCE(), and under LTO, this gets translated into a LDAR instruction
which does not tolerate misaligned accesses.
Interestingly, this does not happen when booting the same kernel
straight from the UEFI shell, and so the fact that the event log may
appear misaligned in memory may be caused by a bug in GRUB or SHIM.
However, using READ_ONCE() to access firmware tables is slightly unusual
in any case, and here, we only need to ensure that 'event' is not
dereferenced again after it gets unmapped, but this is already taken
care of by the implicit barrier() semantics of the early_memunmap()
call.
Cc: <stable(a)vger.kernel.org>
Cc: Peter Jones <pjones(a)redhat.com>
Cc: Jarkko Sakkinen <jarkko(a)kernel.org>
Cc: Matthew Garrett <mjg59(a)srcf.ucam.org>
Reported-by: Nathan Chancellor <nathan(a)kernel.org>
Tested-by: Nathan Chancellor <nathan(a)kernel.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/1782
Signed-off-by: Ard Biesheuvel <ardb(a)kernel.org>
diff --git a/include/linux/tpm_eventlog.h b/include/linux/tpm_eventlog.h
index 20c0ff54b7a0..7d68a5cc5881 100644
--- a/include/linux/tpm_eventlog.h
+++ b/include/linux/tpm_eventlog.h
@@ -198,8 +198,8 @@ static __always_inline int __calc_tpm2_event_size(struct tcg_pcr_event2_head *ev
* The loop below will unmap these fields if the log is larger than
* one page, so save them here for reference:
*/
- count = READ_ONCE(event->count);
- event_type = READ_ONCE(event->event_type);
+ count = event->count;
+ event_type = event->event_type;
/* Verify that it's the log header */
if (event_header->pcr_idx != 0 ||
Hi,
On arm machine we have a compilation error while building
tools/testing/selftests/kvm/ with LTS 5.15.y kernel
rseq_test.c: In function ‘main’:
rseq_test.c:236:33: warning: implicit declaration of function ‘gettid’;
did you mean ‘getgid’? [-Wimplicit-function-declaration]
(void *)(unsigned long)gettid());
^~~~~~
getgid
/tmp/cc4Wfmv2.o: In function `main':
/home/test/linux/tools/testing/selftests/kvm/rseq_test.c:236: undefined
reference to `gettid'
collect2: error: ld returned 1 exit status
make: *** [../lib.mk:151:
/home/test/linux/tools/testing/selftests/kvm/rseq_test] Error 1
make: Leaving directory '/home/test/linux/tools/testing/selftests/kvm'
This is fixed by commit 561cafebb2cf97b0927b4fb0eba22de6200f682e upstream.
Kernel version to apply: 5.15.y LTS
Could we please backport the above fix onto 5.15.y LTS. It applies
cleanly and the kvm selftests compile properly after applying the patch.
Liam pointed the same in [1]
Thanks,
Harshit
[1]
https://lore.kernel.org/all/fdfb143a-45c4-aaff-aa95-d20c076ff555@oracle.com/
From: Denis Nikitin <denik(a)chromium.org>
commit bde971a83bbff78561458ded236605a365411b87 upstream.
Kernel build with clang and KCFLAGS=-fprofile-sample-use=<profile> fails with:
error: arch/arm64/kvm/hyp/nvhe/kvm_nvhe.tmp.o: Unexpected SHT_REL
section ".rel.llvm.call-graph-profile"
Starting from 13.0.0 llvm can generate SHT_REL section, see
https://reviews.llvm.org/rGca3bdb57fa1ac98b711a735de048c12b5fdd8086.
gen-hyprel does not support SHT_REL relocation section.
Filter out profile use flags to fix the build with profile optimization.
Signed-off-by: Denis Nikitin <denik(a)chromium.org>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Link: https://lore.kernel.org/r/20221014184532.3153551-1-denik@chromium.org
Signed-off-by: Stephen Boyd <swboyd(a)chromium.org>
---
I need this to properly compile 5.15.y stable kernels in the chromeos build
system.
arch/arm64/kvm/hyp/nvhe/Makefile | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
index 8d741f71377f..964c2134ea1e 100644
--- a/arch/arm64/kvm/hyp/nvhe/Makefile
+++ b/arch/arm64/kvm/hyp/nvhe/Makefile
@@ -83,6 +83,10 @@ quiet_cmd_hypcopy = HYPCOPY $@
# Remove ftrace, Shadow Call Stack, and CFI CFLAGS.
# This is equivalent to the 'notrace', '__noscs', and '__nocfi' annotations.
KBUILD_CFLAGS := $(filter-out $(CC_FLAGS_FTRACE) $(CC_FLAGS_SCS) $(CC_FLAGS_CFI), $(KBUILD_CFLAGS))
+# Starting from 13.0.0 llvm emits SHT_REL section '.llvm.call-graph-profile'
+# when profile optimization is applied. gen-hyprel does not support SHT_REL and
+# causes a build failure. Remove profile optimization flags.
+KBUILD_CFLAGS := $(filter-out -fprofile-sample-use=% -fprofile-use=%, $(KBUILD_CFLAGS))
# KVM nVHE code is run at a different exception code with a different map, so
# compiler instrumentation that inserts callbacks or checks into the code may
--
https://chromeos.dev
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
406504c7b040 ("KVM: arm64: Fix S1PTW handling on RO memslots")
c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch")
16314874b12b ("Merge branch 'kvm-arm64/misc-5.9' into kvmarm-master/next")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 406504c7b0405d74d74c15a667cd4c4620c3e7a9 Mon Sep 17 00:00:00 2001
From: Marc Zyngier <maz(a)kernel.org>
Date: Tue, 20 Dec 2022 14:03:52 +0000
Subject: [PATCH] KVM: arm64: Fix S1PTW handling on RO memslots
A recent development on the EFI front has resulted in guests having
their page tables baked in the firmware binary, and mapped into the
IPA space as part of a read-only memslot. Not only is this legitimate,
but it also results in added security, so thumbs up.
It is possible to take an S1PTW translation fault if the S1 PTs are
unmapped at stage-2. However, KVM unconditionally treats S1PTW as a
write to correctly handle hardware AF/DB updates to the S1 PTs.
Furthermore, KVM injects an exception into the guest for S1PTW writes.
In the aforementioned case this results in the guest taking an abort
it won't recover from, as the S1 PTs mapping the vectors suffer from
the same problem.
So clearly our handling is... wrong.
Instead, switch to a two-pronged approach:
- On S1PTW translation fault, handle the fault as a read
- On S1PTW permission fault, handle the fault as a write
This is of no consequence to SW that *writes* to its PTs (the write
will trigger a non-S1PTW fault), and SW that uses RO PTs will not
use HW-assisted AF/DB anyway, as that'd be wrong.
Only in the case described in c4ad98e4b72c ("KVM: arm64: Assume write
fault on S1PTW permission fault on instruction fetch") do we end-up
with two back-to-back faults (page being evicted and faulted back).
I don't think this is a case worth optimising for.
Fixes: c4ad98e4b72c ("KVM: arm64: Assume write fault on S1PTW permission fault on instruction fetch")
Reviewed-by: Oliver Upton <oliver.upton(a)linux.dev>
Reviewed-by: Ard Biesheuvel <ardb(a)kernel.org>
Regression-tested-by: Ard Biesheuvel <ardb(a)kernel.org>
Signed-off-by: Marc Zyngier <maz(a)kernel.org>
Cc: stable(a)vger.kernel.org
diff --git a/arch/arm64/include/asm/kvm_emulate.h b/arch/arm64/include/asm/kvm_emulate.h
index 9bdba47f7e14..0d40c48d8132 100644
--- a/arch/arm64/include/asm/kvm_emulate.h
+++ b/arch/arm64/include/asm/kvm_emulate.h
@@ -373,8 +373,26 @@ static __always_inline int kvm_vcpu_sys_get_rt(struct kvm_vcpu *vcpu)
static inline bool kvm_is_write_fault(struct kvm_vcpu *vcpu)
{
- if (kvm_vcpu_abt_iss1tw(vcpu))
- return true;
+ if (kvm_vcpu_abt_iss1tw(vcpu)) {
+ /*
+ * Only a permission fault on a S1PTW should be
+ * considered as a write. Otherwise, page tables baked
+ * in a read-only memslot will result in an exception
+ * being delivered in the guest.
+ *
+ * The drawback is that we end-up faulting twice if the
+ * guest is using any of HW AF/DB: a translation fault
+ * to map the page containing the PT (read only at
+ * first), then a permission fault to allow the flags
+ * to be set.
+ */
+ switch (kvm_vcpu_trap_get_fault_type(vcpu)) {
+ case ESR_ELx_FSC_PERM:
+ return true;
+ default:
+ return false;
+ }
+ }
if (kvm_vcpu_trap_is_iabt(vcpu))
return false;
The patch below does not apply to the 4.14-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 4.19-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 5.4-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 5.10-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 5.15-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
45e966fcca03 ("KVM: x86: Do not return host topology information from KVM_GET_SUPPORTED_CPUID")
cde363ab7ca7 ("Documentation: KVM: add API issues section")
ba7bb663f554 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
4dfc4ec2b7f5 ("Merge branch 'kvm-ppc-cap-210' into kvm-next-5.18")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 45e966fcca03ecdcccac7cb236e16eea38cc18af Mon Sep 17 00:00:00 2001
From: Paolo Bonzini <pbonzini(a)redhat.com>
Date: Sat, 22 Oct 2022 04:17:53 -0400
Subject: [PATCH] KVM: x86: Do not return host topology information from
KVM_GET_SUPPORTED_CPUID
Passing the host topology to the guest is almost certainly wrong
and will confuse the scheduler. In addition, several fields of
these CPUID leaves vary on each processor; it is simply impossible to
return the right values from KVM_GET_SUPPORTED_CPUID in such a way that
they can be passed to KVM_SET_CPUID2.
The values that will most likely prevent confusion are all zeroes.
Userspace will have to override it anyway if it wishes to present a
specific topology to the guest.
Cc: stable(a)vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini(a)redhat.com>
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index deb494f759ed..d8ea37dfddf4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8310,6 +8310,20 @@ CPU[EAX=1]:ECX[24] (TSC_DEADLINE) is not reported by ``KVM_GET_SUPPORTED_CPUID``
It can be enabled if ``KVM_CAP_TSC_DEADLINE_TIMER`` is present and the kernel
has enabled in-kernel emulation of the local APIC.
+CPU topology
+~~~~~~~~~~~~
+
+Several CPUID values include topology information for the host CPU:
+0x0b and 0x1f for Intel systems, 0x8000001e for AMD systems. Different
+versions of KVM return different values for this information and userspace
+should not rely on it. Currently they return all zeroes.
+
+If userspace wishes to set up a guest topology, it should be careful that
+the values of these three leaves differ for each CPU. In particular,
+the APIC ID is found in EDX for all subleaves of 0x0b and 0x1f, and in EAX
+for 0x8000001e; the latter also encodes the core id and node id in bits
+7:0 of EBX and ECX respectively.
+
Obsolete ioctls and capabilities
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index b14653b61470..596061c1610e 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -770,16 +770,22 @@ struct kvm_cpuid_array {
int nent;
};
+static struct kvm_cpuid_entry2 *get_next_cpuid(struct kvm_cpuid_array *array)
+{
+ if (array->nent >= array->maxnent)
+ return NULL;
+
+ return &array->entries[array->nent++];
+}
+
static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
u32 function, u32 index)
{
- struct kvm_cpuid_entry2 *entry;
+ struct kvm_cpuid_entry2 *entry = get_next_cpuid(array);
- if (array->nent >= array->maxnent)
+ if (!entry)
return NULL;
- entry = &array->entries[array->nent++];
-
memset(entry, 0, sizeof(*entry));
entry->function = function;
entry->index = index;
@@ -956,22 +962,13 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->edx = edx.full;
break;
}
- /*
- * Per Intel's SDM, the 0x1f is a superset of 0xb,
- * thus they can be handled by common code.
- */
case 0x1f:
case 0xb:
/*
- * Populate entries until the level type (ECX[15:8]) of the
- * previous entry is zero. Note, CPUID EAX.{0x1f,0xb}.0 is
- * the starting entry, filled by the primary do_host_cpuid().
+ * No topology; a valid topology is indicated by the presence
+ * of subleaf 1.
*/
- for (i = 1; entry->ecx & 0xff00; ++i) {
- entry = do_host_cpuid(array, function, i);
- if (!entry)
- goto out;
- }
+ entry->eax = entry->ebx = entry->ecx = 0;
break;
case 0xd: {
u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
@@ -1202,6 +1199,9 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
entry->ebx = entry->ecx = entry->edx = 0;
break;
case 0x8000001e:
+ /* Do not return host topology information. */
+ entry->eax = entry->ebx = entry->ecx = 0;
+ entry->edx = 0; /* reserved */
break;
case 0x8000001F:
if (!kvm_cpu_cap_has(X86_FEATURE_SEV)) {
The patch below does not apply to the 6.1-stable tree.
If someone wants it applied there, or to any other stable or longterm
tree, then please email the backport, including the original git commit
id to <stable(a)vger.kernel.org>.
Possible dependencies:
16f1f838442d ("Revert "ALSA: usb-audio: Drop superfluous interface setup at parsing"")
thanks,
greg k-h
------------------ original commit in Linus's tree ------------------
From 16f1f838442dc6430d32d51ddda347b8421ec34b Mon Sep 17 00:00:00 2001
From: Takashi Iwai <tiwai(a)suse.de>
Date: Wed, 4 Jan 2023 16:09:44 +0100
Subject: [PATCH] Revert "ALSA: usb-audio: Drop superfluous interface setup at
parsing"
This reverts commit ac5e2fb425e1121ceef2b9d1b3ffccc195d55707.
The commit caused a regression on Behringer UMC404HD (and likely
others). As the change was meant only as a minor optimization, it's
better to revert it to address the regression.
Reported-and-tested-by: Michael Ralston <michael(a)ralston.id.au>
Cc: <stable(a)vger.kernel.org>
Link: https://lore.kernel.org/r/CAC2975JXkS1A5Tj9b02G_sy25ZWN-ys+tc9wmkoS=qPgKCog…
Link: https://lore.kernel.org/r/20230104150944.24918-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai(a)suse.de>
diff --git a/sound/usb/stream.c b/sound/usb/stream.c
index f75601ca2d52..f10f4e6d3fb8 100644
--- a/sound/usb/stream.c
+++ b/sound/usb/stream.c
@@ -1222,6 +1222,12 @@ static int __snd_usb_parse_audio_interface(struct snd_usb_audio *chip,
if (err < 0)
return err;
}
+
+ /* try to set the interface... */
+ usb_set_interface(chip->dev, iface_no, 0);
+ snd_usb_init_pitch(chip, fp);
+ snd_usb_init_sample_rate(chip, fp, fp->rate_max);
+ usb_set_interface(chip->dev, iface_no, altno);
}
return 0;
}
I'm announcing the release of the 6.1.6 kernel.
All users of the 6.1 kernel series must upgrade.
The updated 6.1.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-6.1.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/parisc/include/uapi/asm/mman.h | 29 ++---
arch/parisc/kernel/sys_parisc.c | 28 +++++
arch/parisc/kernel/syscalls/syscall.tbl | 2
arch/x86/kernel/fpu/core.c | 19 +--
arch/x86/kernel/fpu/regset.c | 2
arch/x86/kernel/fpu/signal.c | 2
arch/x86/kernel/fpu/xstate.c | 52 +++++++++-
arch/x86/kernel/fpu/xstate.h | 4
fs/nfsd/nfs4proc.c | 7 -
fs/nfsd/nfs4xdr.c | 2
init/Kconfig | 6 +
net/sched/sch_api.c | 5 +
net/sunrpc/auth_gss/svcauth_gss.c | 4
net/sunrpc/svc.c | 6 -
net/sunrpc/svc_xprt.c | 2
net/sunrpc/svcsock.c | 8 -
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2
sound/core/control.c | 24 +++-
sound/pci/hda/cs35l41_hda.c | 20 +++-
sound/pci/hda/patch_hdmi.c | 1
sound/pci/hda/patch_realtek.c | 2
tools/arch/parisc/include/uapi/asm/mman.h | 12 +-
tools/perf/bench/bench.h | 12 --
tools/testing/selftests/vm/pkey-x86.h | 12 ++
tools/testing/selftests/vm/protection_keys.c | 131 ++++++++++++++++++++++++++-
26 files changed, 308 insertions(+), 88 deletions(-)
Adrian Chan (1):
ALSA: hda/hdmi: Add a HP device 0x8715 to force connect list
Chris Chiu (1):
ALSA: hda - Enable headset mic on another Dell laptop with ALC3254
Chuck Lever (1):
Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"
Clement Lecigne (1):
ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF
Frederick Lawler (1):
net: sched: disallow noqueue for qdisc classes
Greg Kroah-Hartman (1):
Linux 6.1.6
Helge Deller (1):
parisc: Align parisc MADV_XXX constants with all other architectures
Jeremy Szu (1):
ALSA: hda/realtek: fix mute/micmute LEDs don't work for a HP platform
Kyle Huey (6):
x86/fpu: Take task_struct* in copy_sigframe_from_user_to_xstate()
x86/fpu: Add a pkru argument to copy_uabi_from_kernel_to_xstate().
x86/fpu: Add a pkru argument to copy_uabi_to_xstate()
x86/fpu: Allow PKRU to be (once again) written by ptrace.
x86/fpu: Emulate XRSTOR's behavior if the xfeatures PKRU bit is not set
selftests/vm/pkeys: Add a regression test for setting PKRU through ptrace
Linus Torvalds (1):
gcc: disable -Warray-bounds for gcc-11 too
Takashi Iwai (2):
ALSA: hda: cs35l41: Don't return -EINVAL from system suspend/resume
ALSA: hda: cs35l41: Check runtime suspend capability at runtime_idle
I'm announcing the release of the 5.15.88 kernel.
All users of the 5.15 kernel series must upgrade.
The updated 5.15.y git tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-5.15.y
and can be browsed at the normal kernel.org git web browser:
https://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=summary
thanks,
greg k-h
------------
Makefile | 2
arch/parisc/include/uapi/asm/mman.h | 27 ++---
arch/parisc/kernel/sys_parisc.c | 27 +++++
arch/parisc/kernel/syscalls/syscall.tbl | 2
arch/x86/include/asm/fpu/xstate.h | 4
arch/x86/kernel/fpu/regset.c | 2
arch/x86/kernel/fpu/signal.c | 2
arch/x86/kernel/fpu/xstate.c | 41 +++++++-
drivers/tty/serial/fsl_lpuart.c | 2
drivers/tty/serial/serial_core.c | 3
net/ipv4/inet_connection_sock.c | 16 +++
net/ipv4/tcp_ulp.c | 4
net/sched/sch_api.c | 5 +
sound/core/control.c | 24 +++-
sound/pci/hda/patch_hdmi.c | 1
sound/pci/hda/patch_realtek.c | 1
tools/arch/parisc/include/uapi/asm/mman.h | 12 +-
tools/perf/bench/bench.h | 12 --
tools/testing/selftests/vm/pkey-x86.h | 12 ++
tools/testing/selftests/vm/protection_keys.c | 131 ++++++++++++++++++++++++++-
20 files changed, 273 insertions(+), 57 deletions(-)
Adrian Chan (1):
ALSA: hda/hdmi: Add a HP device 0x8715 to force connect list
Chris Chiu (1):
ALSA: hda - Enable headset mic on another Dell laptop with ALC3254
Clement Lecigne (1):
ALSA: pcm: Move rwsem lock inside snd_ctl_elem_read to prevent UAF
Frederick Lawler (1):
net: sched: disallow noqueue for qdisc classes
Greg Kroah-Hartman (1):
Linux 5.15.88
Helge Deller (1):
parisc: Align parisc MADV_XXX constants with all other architectures
Kyle Huey (6):
x86/fpu: Take task_struct* in copy_sigframe_from_user_to_xstate()
x86/fpu: Add a pkru argument to copy_uabi_from_kernel_to_xstate().
x86/fpu: Add a pkru argument to copy_uabi_to_xstate()
x86/fpu: Allow PKRU to be (once again) written by ptrace.
x86/fpu: Emulate XRSTOR's behavior if the xfeatures PKRU bit is not set
selftests/vm/pkeys: Add a regression test for setting PKRU through ptrace
Paolo Abeni (1):
net/ulp: prevent ULP without clone op from entering the LISTEN status
Rasmus Villemoes (1):
serial: fixup backport of "serial: Deassert Transmit Enable on probe in driver-specific way"
The x86-64 memmove code has two ALTERNATIVE statements in a row, one to
handle FSRM ("Fast Short REP MOVSB"), and one to handle ERMS ("Enhanced
REP MOVSB"). If either of these features is present, the goal is to jump
directly to a REP MOVSB; otherwise, some setup code that handles short
lengths is executed. The first comparison of a sequence of specific
small sizes is included in the first ALTERNATIVE, so it will be replaced
by NOPs if FSRM is set, and then (assuming ERMS is also set) execution
will fall through to the JMP to a REP MOVSB in the next ALTERNATIVE.
The two ALTERNATIVE invocations can be combined into a single instance
of ALTERNATIVE_2 to simplify and slightly shorten the code. If either
FSRM or ERMS is set, the first instruction in the memmove_begin_forward
path will be replaced with a jump to the REP MOVSB.
This also prevents a problem when FSRM is set but ERMS is not; in this
case, the previous code would have replaced both ALTERNATIVEs with NOPs
and skipped the first check for sizes less than 0x20 bytes. This
combination of CPU features is arguably a firmware bug, but this patch
makes the function robust against this badness.
Fixes: f444a5ff95dc ("x86/cpufeatures: Add support for fast short REP; MOVSB")
Signed-off-by: Daniel Verkamp <dverkamp(a)chromium.org>
---
arch/x86/lib/memmove_64.S | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 724bbf83eb5b..1fc36dbd3bdc 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -38,8 +38,10 @@ SYM_FUNC_START(__memmove)
/* FSRM implies ERMS => no length checks, do the copy directly */
.Lmemmove_begin_forward:
- ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
- ALTERNATIVE "", "jmp .Lmemmove_erms", X86_FEATURE_ERMS
+ ALTERNATIVE_2 \
+ "cmp $0x20, %rdx; jb 1f", \
+ "jmp .Lmemmove_erms", X86_FEATURE_FSRM, \
+ "jmp .Lmemmove_erms", X86_FEATURE_ERMS
/*
* movsq instruction have many startup latency
--
2.39.0.314.g84b9a713c41-goog
This reverts commit 7efc3b7261030da79001c00d92bc3392fd6c664c.
We have got openSUSE reports (Link 1) for 6.1 kernel with khugepaged
stalling CPU for long periods of time. Investigation of tracepoint data
shows that compaction is stuck in repeating fast_find_migrateblock()
based migrate page isolation, and then fails to migrate all isolated
pages. Commit 7efc3b726103 ("mm/compaction: fix set skip in
fast_find_migrateblock") was suspected as it was merged in 6.1 and in
theory can indeed remove a termination condition for
fast_find_migrateblock() under certain conditions, as it removes a place
that always marks a scanned pageblock from being re-scanned. There are
other such places, but those can be skipped under certain conditions,
which seems to match the tracepoint data.
Testing of revert also appears to have resolved the issue, thus revert
the commit until a more robust solution for the original problem is
developed.
It's also likely this will fix qemu stalls with 6.1 kernel reported in
Link 2, but that is not yet confirmed.
Link: https://bugzilla.suse.com/show_bug.cgi?id=1206848
Link: https://lore.kernel.org/kvm/b8017e09-f336-3035-8344-c549086c2340@kernel.org/
Fixes: 7efc3b726103 ("mm/compaction: fix set skip in fast_find_migrateblock")
Cc: <stable(a)vger.kernel.org>
---
mm/compaction.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/compaction.c b/mm/compaction.c
index ca1603524bbe..8238e83385a7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1839,6 +1839,7 @@ static unsigned long fast_find_migrateblock(struct compact_control *cc)
pfn = cc->zone->zone_start_pfn;
cc->fast_search_fail = 0;
found_block = true;
+ set_pageblock_skip(freepage);
break;
}
}
--
2.39.0
The x86-64 memmove code has two ALTERNATIVE statements in a row, one to
handle FSRM ("Fast Short REP MOVSB"), and one to handle ERMS ("Enhanced
REP MOVSB"). If either of these features is present, the goal is to jump
directly to a REP MOVSB; otherwise, some setup code that handles short
lengths is executed. The first comparison of a sequence of specific
small sizes is included in the first ALTERNATIVE, so it will be replaced
by NOPs if FSRM is set, and then (assuming ERMS is also set) execution
will fall through to the JMP to a REP MOVSB in the next ALTERNATIVE.
The two ALTERNATIVE invocations can be combined into a single instance
of ALTERNATIVE_2 to simplify and slightly shorten the code. If either
FSRM or ERMS is set, the first instruction in the memmove_begin_forward
path will be replaced with a jump to the REP MOVSB.
This also prevents a problem when FSRM is set but ERMS is not; in this
case, the previous code would have replaced both ALTERNATIVEs with NOPs
and skipped the first check for sizes less than 0x20 bytes. This
combination of CPU features is arguably a firmware bug, but this patch
makes the function robust against this badness.
Fixes: f444a5ff95dc ("x86/cpufeatures: Add support for fast short REP; MOVSB")
Signed-off-by: Daniel Verkamp <dverkamp(a)chromium.org>
---
arch/x86/lib/memmove_64.S | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/lib/memmove_64.S b/arch/x86/lib/memmove_64.S
index 724bbf83eb5b..1fc36dbd3bdc 100644
--- a/arch/x86/lib/memmove_64.S
+++ b/arch/x86/lib/memmove_64.S
@@ -38,8 +38,10 @@ SYM_FUNC_START(__memmove)
/* FSRM implies ERMS => no length checks, do the copy directly */
.Lmemmove_begin_forward:
- ALTERNATIVE "cmp $0x20, %rdx; jb 1f", "", X86_FEATURE_FSRM
- ALTERNATIVE "", "jmp .Lmemmove_erms", X86_FEATURE_ERMS
+ ALTERNATIVE_2 \
+ "cmp $0x20, %rdx; jb 1f", \
+ "jmp .Lmemmove_erms", X86_FEATURE_FSRM, \
+ "jmp .Lmemmove_erms", X86_FEATURE_ERMS
/*
* movsq instruction have many startup latency
--
2.39.0.314.g84b9a713c41-goog
Before these patches, the Userspace Path Manager would allow the
creation of subflows with wrong families: taking the one of the MPTCP
socket instead of the provided ones and resulting in the creation of
subflows with likely not the right source and/or destination IPs. It
would also allow the creation of subflows between different families or
not respecting v4/v6-only socket attributes.
Patch 1 lets the userspace PM select the proper family to avoid creating
subflows with the wrong source and/or destination addresses because the
family is not the expected one.
Patch 2 makes sure the userspace PM doesn't allow the userspace to
create subflows for a family that is not allowed.
Patch 3 validates scenarios with a mix of v4 and v6 subflows for the
same MPTCP connection.
These patches fix issues introduced in v5.19 when the userspace path
manager has been introduced.
To: "David S. Miller" <davem(a)davemloft.net>
To: Eric Dumazet <edumazet(a)google.com>
To: Jakub Kicinski <kuba(a)kernel.org>
To: Kishen Maloor <kishen.maloor(a)intel.com>
To: Florian Westphal <fw(a)strlen.de>
To: Shuah Khan <shuah(a)kernel.org>
Cc: netdev(a)vger.kernel.org
Cc: mptcp(a)lists.linux.dev
Cc: linux-kernel(a)vger.kernel.org
Cc: linux-kselftest(a)vger.kernel.org
Cc: Paolo Abeni <pabeni(a)redhat.com>
Cc: Mat Martineau <mathew.j.martineau(a)linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts(a)tessares.net>
---
Matthieu Baerts (2):
mptcp: netlink: respect v4/v6-only sockets
selftests: mptcp: userspace: validate v4-v6 subflows mix
Paolo Abeni (1):
mptcp: explicitly specify sock family at subflow creation time
net/mptcp/pm.c | 25 ++++++++++++
net/mptcp/pm_userspace.c | 7 ++++
net/mptcp/protocol.c | 2 +-
net/mptcp/protocol.h | 6 ++-
net/mptcp/subflow.c | 9 +++--
tools/testing/selftests/net/mptcp/userspace_pm.sh | 47 +++++++++++++++++++++++
6 files changed, 90 insertions(+), 6 deletions(-)
---
base-commit: be53771c87f4e322a9835d3faa9cd73a4ecdec5b
change-id: 20230112-upstream-net-20230112-netlink-v4-v6-b6b958039ee0
Best regards,
--
Matthieu Baerts <matthieu.baerts(a)tessares.net>
Hello!
This is an experimental semi-automated report about issues detected by
Coverity from a scan of next-20230113 as part of the linux-next scan project:
https://scan.coverity.com/projects/linux-next-weekly-scan
You're getting this email because you were associated with the identified
lines of code (noted below) that were touched by commits:
Wed Jan 11 16:15:43 2023 -0800
ebc4c1bcc2a5 ("maple_tree: fix mas_empty_area_rev() lower bound validation")
Coverity reported the following:
*** CID 1530569: Error handling issues (CHECKED_RETURN)
lib/test_maple_tree.c:2598 in check_empty_area_window()
2592 MT_BUG_ON(mt, mas_empty_area(&mas, 5, 100, 6) != -EBUSY);
2593
2594 mas_reset(&mas);
2595 MT_BUG_ON(mt, mas_empty_area(&mas, 0, 8, 10) != -EBUSY);
2596
2597 mas_reset(&mas);
vvv CID 1530569: Error handling issues (CHECKED_RETURN)
vvv Calling "mas_empty_area" without checking return value (as is done elsewhere 8 out of 10 times).
2598 mas_empty_area(&mas, 100, 165, 3);
2599
2600 mas_reset(&mas);
2601 MT_BUG_ON(mt, mas_empty_area(&mas, 100, 163, 6) != -EBUSY);
2602 rcu_read_unlock();
2603 }
If this is a false positive, please let us know so we can mark it as
such, or teach the Coverity rules to be smarter. If not, please make
sure fixes get into linux-next. :) For patches fixing this, please
include these lines (but double-check the "Fixes" first):
Reported-by: coverity-bot <keescook+coverity-bot(a)chromium.org>
Addresses-Coverity-ID: 1530569 ("Error handling issues")
Fixes: ebc4c1bcc2a5 ("maple_tree: fix mas_empty_area_rev() lower bound validation")
Thanks for your attention!
--
Coverity-bot