Recently a performance problem was reported for a process invoking a non-trival ASL program. The method call in this case ends up repetitively triggering a call path like:
acpi_ex_store acpi_ex_store_object_to_node acpi_ex_write_data_to_field acpi_ex_insert_into_field acpi_ex_write_with_update_rule acpi_ex_field_datum_io acpi_ex_access_region acpi_ev_address_space_dispatch acpi_ex_system_memory_space_handler acpi_os_map_cleanup.part.14 _synchronize_rcu_expedited.constprop.89 schedule
The end result of frequent synchronize_rcu_expedited() invocation is tiny sub-millisecond spurts of execution where the scheduler freely migrates this apparently sleepy task. The overhead of frequent scheduler invocation multiplies the execution time by a factor of 2-3X.
For example, performance improves from 16 minutes to 7 minutes for a firmware update procedure across 24 devices.
Perhaps the rcu usage was intended to allow for not taking a sleeping lock in the acpi_os_{read,write}_memory() path which ostensibly could be called from an APEI NMI error interrupt? Neither rcu_read_lock() nor ioremap() are interrupt safe, so add a WARN_ONCE() to validate that rcu was not serving as a mechanism to avoid direct calls to ioremap(). Even the original implementation had a spin_lock_irqsave(), but that is not NMI safe.
APEI itself already has some concept of avoiding ioremap() from interrupt context (see erst_exec_move_data()), if the new warning triggers it means that APEI either needs more instrumentation like that to pre-emptively fail, or more infrastructure to arrange for pre-mapping the resources it needs in NMI context.
Cc: stable@vger.kernel.org Fixes: 620242ae8c3d ("ACPI: Maintain a list of ACPI memory mapped I/O remappings") Cc: Len Brown lenb@kernel.org Cc: Borislav Petkov bp@alien8.de Cc: Ira Weiny ira.weiny@intel.com Cc: James Morse james.morse@arm.com Cc: Erik Kaneda erik.kaneda@intel.com Cc: Myron Stowe myron.stowe@redhat.com Cc: "Rafael J. Wysocki" rjw@rjwysocki.net Cc: Andy Shevchenko andriy.shevchenko@linux.intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- Changes since v1 [1]:
- Actually cc: the most important list for ACPI changes (Rafael)
- Cleanup unnecessary variable initialization (Andy)
Link: https://lore.kernel.org/linux-nvdimm/158880834905.2183490.156163294694202340...
drivers/acpi/osl.c | 117 +++++++++++++++++++++++++--------------------------- 1 file changed, 57 insertions(+), 60 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c index 762c5d50b8fe..a44b75aac5d0 100644 --- a/drivers/acpi/osl.c +++ b/drivers/acpi/osl.c @@ -214,13 +214,13 @@ acpi_physical_address __init acpi_os_get_root_pointer(void) return pa; }
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */ static struct acpi_ioremap * acpi_map_lookup(acpi_physical_address phys, acpi_size size) { struct acpi_ioremap *map;
- list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held()) + lockdep_assert_held(&acpi_ioremap_lock); + list_for_each_entry(map, &acpi_ioremaps, list) if (map->phys <= phys && phys + size <= map->phys + map->size) return map; @@ -228,7 +228,6 @@ acpi_map_lookup(acpi_physical_address phys, acpi_size size) return NULL; }
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */ static void __iomem * acpi_map_vaddr_lookup(acpi_physical_address phys, unsigned int size) { @@ -263,7 +262,8 @@ acpi_map_lookup_virt(void __iomem *virt, acpi_size size) { struct acpi_ioremap *map;
- list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held()) + lockdep_assert_held(&acpi_ioremap_lock); + list_for_each_entry(map, &acpi_ioremaps, list) if (map->virt <= virt && virt + size <= map->virt + map->size) return map; @@ -360,7 +360,7 @@ void __iomem __ref map->size = pg_sz; map->refcount = 1;
- list_add_tail_rcu(&map->list, &acpi_ioremaps); + list_add_tail(&map->list, &acpi_ioremaps);
out: mutex_unlock(&acpi_ioremap_lock); @@ -374,20 +374,13 @@ void *__ref acpi_os_map_memory(acpi_physical_address phys, acpi_size size) } EXPORT_SYMBOL_GPL(acpi_os_map_memory);
-/* Must be called with mutex_lock(&acpi_ioremap_lock) */ -static unsigned long acpi_os_drop_map_ref(struct acpi_ioremap *map) -{ - unsigned long refcount = --map->refcount; - - if (!refcount) - list_del_rcu(&map->list); - return refcount; -} - -static void acpi_os_map_cleanup(struct acpi_ioremap *map) +static void acpi_os_drop_map_ref(struct acpi_ioremap *map) { - synchronize_rcu_expedited(); + lockdep_assert_held(&acpi_ioremap_lock); + if (--map->refcount > 0) + return; acpi_unmap(map->phys, map->virt); + list_del(&map->list); kfree(map); }
@@ -408,7 +401,6 @@ static void acpi_os_map_cleanup(struct acpi_ioremap *map) void __ref acpi_os_unmap_iomem(void __iomem *virt, acpi_size size) { struct acpi_ioremap *map; - unsigned long refcount;
if (!acpi_permanent_mmap) { __acpi_unmap_table(virt, size); @@ -422,11 +414,8 @@ void __ref acpi_os_unmap_iomem(void __iomem *virt, acpi_size size) WARN(true, PREFIX "%s: bad address %p\n", __func__, virt); return; } - refcount = acpi_os_drop_map_ref(map); + acpi_os_drop_map_ref(map); mutex_unlock(&acpi_ioremap_lock); - - if (!refcount) - acpi_os_map_cleanup(map); } EXPORT_SYMBOL_GPL(acpi_os_unmap_iomem);
@@ -461,7 +450,6 @@ void acpi_os_unmap_generic_address(struct acpi_generic_address *gas) { u64 addr; struct acpi_ioremap *map; - unsigned long refcount;
if (gas->space_id != ACPI_ADR_SPACE_SYSTEM_MEMORY) return; @@ -477,11 +465,8 @@ void acpi_os_unmap_generic_address(struct acpi_generic_address *gas) mutex_unlock(&acpi_ioremap_lock); return; } - refcount = acpi_os_drop_map_ref(map); + acpi_os_drop_map_ref(map); mutex_unlock(&acpi_ioremap_lock); - - if (!refcount) - acpi_os_map_cleanup(map); } EXPORT_SYMBOL(acpi_os_unmap_generic_address);
@@ -700,55 +685,71 @@ int acpi_os_read_iomem(void __iomem *virt_addr, u64 *value, u32 width) return 0; }
+static void __iomem *acpi_os_rw_map(acpi_physical_address phys_addr, + unsigned int size, bool *did_fallback) +{ + void __iomem *virt_addr; + + if (WARN_ONCE(in_interrupt(), "ioremap in interrupt context\n")) + return NULL; + + /* Try to use a cached mapping and fallback otherwise */ + *did_fallback = false; + mutex_lock(&acpi_ioremap_lock); + virt_addr = acpi_map_vaddr_lookup(phys_addr, size); + if (virt_addr) + return virt_addr; + mutex_unlock(&acpi_ioremap_lock); + + virt_addr = acpi_os_ioremap(phys_addr, size); + *did_fallback = true; + + return virt_addr; +} + +static void acpi_os_rw_unmap(void __iomem *virt_addr, bool did_fallback) +{ + if (did_fallback) { + /* in the fallback case no lock is held */ + iounmap(virt_addr); + return; + } + + mutex_unlock(&acpi_ioremap_lock); +} + acpi_status acpi_os_read_memory(acpi_physical_address phys_addr, u64 *value, u32 width) { - void __iomem *virt_addr; unsigned int size = width / 8; - bool unmap = false; + bool did_fallback = false; + void __iomem *virt_addr; u64 dummy; int error;
- rcu_read_lock(); - virt_addr = acpi_map_vaddr_lookup(phys_addr, size); - if (!virt_addr) { - rcu_read_unlock(); - virt_addr = acpi_os_ioremap(phys_addr, size); - if (!virt_addr) - return AE_BAD_ADDRESS; - unmap = true; - } - + virt_addr = acpi_os_rw_map(phys_addr, size, &did_fallback); + if (!virt_addr) + return AE_BAD_ADDRESS; if (!value) value = &dummy;
error = acpi_os_read_iomem(virt_addr, value, width); BUG_ON(error);
- if (unmap) - iounmap(virt_addr); - else - rcu_read_unlock(); - + acpi_os_rw_unmap(virt_addr, did_fallback); return AE_OK; }
acpi_status acpi_os_write_memory(acpi_physical_address phys_addr, u64 value, u32 width) { - void __iomem *virt_addr; unsigned int size = width / 8; - bool unmap = false; + bool did_fallback = false; + void __iomem *virt_addr;
- rcu_read_lock(); - virt_addr = acpi_map_vaddr_lookup(phys_addr, size); - if (!virt_addr) { - rcu_read_unlock(); - virt_addr = acpi_os_ioremap(phys_addr, size); - if (!virt_addr) - return AE_BAD_ADDRESS; - unmap = true; - } + virt_addr = acpi_os_rw_map(phys_addr, size, &did_fallback); + if (!virt_addr) + return AE_BAD_ADDRESS;
switch (width) { case 8: @@ -767,11 +768,7 @@ acpi_os_write_memory(acpi_physical_address phys_addr, u64 value, u32 width) BUG(); }
- if (unmap) - iounmap(virt_addr); - else - rcu_read_unlock(); - + acpi_os_rw_unmap(virt_addr, did_fallback); return AE_OK; }
Hi
[This is an automated email]
This commit has been processed because it contains a "Fixes:" tag fixing commit: 620242ae8c3d ("ACPI: Maintain a list of ACPI memory mapped I/O remappings").
The bot has tested the following trees: v5.6.11, v5.4.39, v4.19.121, v4.14.179, v4.9.222, v4.4.222.
v5.6.11: Build OK! v5.4.39: Build OK! v4.19.121: Failed to apply! Possible dependencies: bee6f87166e9 ("acpi: Use built-in RCU list checking for acpi_ioremaps list")
v4.14.179: Failed to apply! Possible dependencies: bee6f87166e9 ("acpi: Use built-in RCU list checking for acpi_ioremaps list") eeb2d80d502a ("ACPI / LPIT: Add Low Power Idle Table (LPIT) support")
v4.9.222: Failed to apply! Possible dependencies: bee6f87166e9 ("acpi: Use built-in RCU list checking for acpi_ioremaps list") d44fa3d46079 ("ACPI: Add support for ResourceSource/IRQ domain mapping") eeb2d80d502a ("ACPI / LPIT: Add Low Power Idle Table (LPIT) support")
v4.4.222: Failed to apply! Possible dependencies: 05d961320ba6 ("of: earlycon: Fixup earlycon console name and index") 088da2a17619 ("of: earlycon: Initialize port fields from DT properties") 2eaa790989e0 ("earlycon: Use common framework for earlycon declarations") 4d118c9a8665 ("of: earlycon: Add options string handling") 5ae74f2cc2f1 ("ACPI / tables: Move table override mechanisms to tables.c") 836d0830188a ("ACPI / debugger: Add module support for ACPI debugger") 8cfb0cdf07e2 ("ACPI / debugger: Add IO interface to access debugger functionalities") a543132ee04d ("ACPI / OSL: Clean up initrd table override code") ad1696f6f09d ("ACPI: parse SPCR and enable matching console") af06f8b7a102 ("ACPI / x86: Cleanup initrd related code") bee6f87166e9 ("acpi: Use built-in RCU list checking for acpi_ioremaps list") c85cc817e5b6 ("ACPI / OSL: Add support to install tables via initrd") d503187b6cc4 ("of/serial: move earlycon early_param handling to serial") da3d3f98d28b ("ACPI / tables: table upgrade: refactor function definitions") eeb2d80d502a ("ACPI / LPIT: Add Low Power Idle Table (LPIT) support") f8d31489629c ("ACPICA: Debugger: Convert some mechanisms to OSPM specific")
NOTE: The patch will not be queued to stable trees until it is upstream.
How should we proceed with this patch?
Greeting,
FYI, we noticed the following commit (built with gcc-7):
commit: 5a91d41f89e8874053e12766fa8eb5eaa997d277 ("[PATCH v2] ACPI: Drop rcu usage for MMIO mappings") url: https://github.com/0day-ci/linux/commits/Dan-Williams/ACPI-Drop-rcu-usage-fo... base: https://git.kernel.org/cgit/linux/kernel/git/rafael/linux-pm.git linux-next
in testcase: v4l2 with following parameters:
test: device ucode: 0x43
on test machine: 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 256G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag Reported-by: kernel test robot rong.a.chen@intel.com
[ 15.593161] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935 [ 15.594078] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0 [ 15.594078] 2 locks held by swapper/0/1: [ 15.594078] #0: ffff88a08055b188 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0x1d/0x60 [ 15.594078] #1: ffffffff82a1a658 (ghes_notify_lock_irq){....}-{2:2}, at: ghes_probe+0x1c7/0x470 [ 15.594078] irq event stamp: 11350142 [ 15.594078] hardirqs last enabled at (11350141): [<ffffffff8137d9bf>] kfree+0x18f/0x2f0 [ 15.594078] hardirqs last disabled at (11350142): [<ffffffff81cc0c00>] _raw_spin_lock_irqsave+0x20/0x60 [ 15.594078] softirqs last enabled at (11350110): [<ffffffff82000353>] __do_softirq+0x353/0x466 [ 15.594078] softirqs last disabled at (11350103): [<ffffffff8113c277>] irq_exit+0xe7/0xf0 [ 15.594078] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4-00022-g5a91d41f89e88 #1 [ 15.594078] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 15.594078] Call Trace: [ 15.594078] dump_stack+0x8f/0xcb [ 15.594078] ___might_sleep+0x175/0x260 [ 15.594078] __mutex_lock+0x55/0x9f0 [ 15.594078] ? cpumask_next+0x17/0x20 [ 15.594078] ? validate_chain+0xdec/0x1240 [ 15.594078] ? rdinit_setup+0x2b/0x2b [ 15.594078] ? acpi_os_rw_map+0x34/0xb0 [ 15.594078] acpi_os_rw_map+0x34/0xb0 [ 15.594078] acpi_os_read_memory+0x34/0xc0 [ 15.594078] ? lock_acquire+0xac/0x390 [ 15.594078] apei_read+0x97/0xb0 [ 15.594078] __ghes_peek_estatus+0x27/0xc0 [ 15.594078] ghes_proc+0x37/0x120 [ 15.594078] ghes_probe+0x1d2/0x470 [ 15.594078] platform_drv_probe+0x37/0x90 [ 15.594078] really_probe+0xef/0x430 [ 15.594078] driver_probe_device+0x110/0x120 [ 15.594078] device_driver_attach+0x4f/0x60 [ 15.594078] __driver_attach+0x9c/0x140 [ 15.594078] ? device_driver_attach+0x60/0x60 [ 15.594078] bus_for_each_dev+0x79/0xc0 [ 15.594078] bus_add_driver+0x147/0x220 [ 15.594078] ? bert_init+0x229/0x229 [ 15.594078] driver_register+0x5b/0xf0 [ 15.594078] ? bert_init+0x229/0x229 [ 15.594078] ghes_init+0x83/0xde [ 15.594078] do_one_initcall+0x5d/0x2f0 [ 15.594078] ? rdinit_setup+0x2b/0x2b [ 15.594078] ? rcu_read_lock_sched_held+0x52/0x80 [ 15.594078] kernel_init_freeable+0x260/0x2da [ 15.594078] ? rest_init+0x250/0x250 [ 15.594078] kernel_init+0xa/0x110 [ 15.594078] ret_from_fork+0x3a/0x50 [ 15.594078] [ 15.594078] ============================= [ 15.594078] [ BUG: Invalid wait context ] [ 15.594078] 5.7.0-rc4-00022-g5a91d41f89e88 #1 Tainted: G W [ 15.594078] ----------------------------- [ 15.594078] swapper/0/1 is trying to lock: [ 15.594078] ffffffff82a0b5c8 (acpi_ioremap_lock){+.+.}-{3:3}, at: acpi_os_rw_map+0x34/0xb0 [ 15.594078] other info that might help us debug this: [ 15.594078] context-{4:4} [ 15.594078] 2 locks held by swapper/0/1: [ 15.594078] #0: ffff88a08055b188 (&dev->mutex){....}-{3:3}, at: device_driver_attach+0x1d/0x60 [ 15.594078] #1: ffffffff82a1a658 (ghes_notify_lock_irq){....}-{2:2}, at: ghes_probe+0x1c7/0x470 [ 15.594078] stack backtrace: [ 15.594078] CPU: 1 PID: 1 Comm: swapper/0 Tainted: G W 5.7.0-rc4-00022-g5a91d41f89e88 #1 [ 15.594078] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 15.594078] Call Trace: [ 15.594078] dump_stack+0x8f/0xcb [ 15.594078] __lock_acquire+0x61e/0xbc0 [ 15.594078] lock_acquire+0xac/0x390 [ 15.594078] ? acpi_os_rw_map+0x34/0xb0 [ 15.594078] ? acpi_os_rw_map+0x34/0xb0 [ 15.594078] ? acpi_os_rw_map+0x34/0xb0 [ 15.594078] __mutex_lock+0xa1/0x9f0 [ 15.594078] ? acpi_os_rw_map+0x34/0xb0 [ 15.594078] ? cpumask_next+0x17/0x20 [ 15.594078] ? rdinit_setup+0x2b/0x2b [ 15.594078] ? acpi_os_rw_map+0x34/0xb0 [ 15.594078] acpi_os_rw_map+0x34/0xb0 [ 15.594078] acpi_os_read_memory+0x34/0xc0 [ 15.594078] ? lock_acquire+0xac/0x390 [ 15.594078] apei_read+0x97/0xb0 [ 15.594078] __ghes_peek_estatus+0x27/0xc0 [ 15.594078] ghes_proc+0x37/0x120 [ 15.594078] ghes_probe+0x1d2/0x470 [ 15.594078] platform_drv_probe+0x37/0x90 [ 15.594078] really_probe+0xef/0x430 [ 15.594078] driver_probe_device+0x110/0x120 [ 15.594078] device_driver_attach+0x4f/0x60 [ 15.594078] __driver_attach+0x9c/0x140 [ 15.594078] ? device_driver_attach+0x60/0x60 [ 15.594078] bus_for_each_dev+0x79/0xc0 [ 15.594078] bus_add_driver+0x147/0x220 [ 15.594078] ? bert_init+0x229/0x229 [ 15.594078] driver_register+0x5b/0xf0 [ 15.594078] ? bert_init+0x229/0x229 [ 15.594078] ghes_init+0x83/0xde [ 15.594078] do_one_initcall+0x5d/0x2f0 [ 15.594078] ? rdinit_setup+0x2b/0x2b [ 15.594078] ? rcu_read_lock_sched_held+0x52/0x80 [ 15.594078] kernel_init_freeable+0x260/0x2da [ 15.594078] ? rest_init+0x250/0x250 [ 15.594078] kernel_init+0xa/0x110 [ 15.594078] ret_from_fork+0x3a/0x50 [ 16.109557] GHES: APEI firmware first mode is enabled by APEI bit and WHEA _OSC. [ 16.118199] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled [ 16.125308] 00:02: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A [ 16.133596] 00:03: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A [ 16.142660] Non-volatile memory driver v1.3 [ 16.147391] Linux agpgart interface v0.103 [ 16.158812] lkdtm: No crash points registered, enable through debugfs [ 16.166314] rdac: device handler registered [ 16.171055] hp_sw: device handler registered [ 16.175838] emc: device handler registered [ 16.180592] alua: device handler registered [ 16.185679] MACsec IEEE 802.1AE [ 16.189408] libphy: Fixed MDIO Bus: probed [ 16.194253] e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI [ 16.202138] e1000: Copyright (c) 1999-2006 Intel Corporation. [ 16.208641] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k [ 16.215173] e1000e: Copyright(c) 1999 - 2015 Intel Corporation. [ 16.221901] igb: Intel(R) Gigabit Ethernet Network Driver - version 5.6.0-k [ 16.229691] igb: Copyright (c) 2007-2014 Intel Corporation. [ 16.235985] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 5.1.0-k [ 16.244549] ixgbe: Copyright (c) 1999-2016 Intel Corporation. [ 16.251177] IOAPIC[9]: Set IRTE entry (P:1 FPD:0 Dst_Mode:1 Redir_hint:1 Trig_Mode:0 Dlvry_Mode:0 Avail:0 Vector:EF Dest:00000001 SID:002C SQ:0 SVT:1) [ 16.266251] IOAPIC[1]: Set routing entry (9-13 -> 0xef -> IRQ 38 Mode:1 Active:1 Dest:1) [ 16.548010] ixgbe 0000:03:00.0: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0 [ 16.643801] ixgbe 0000:03:00.0: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link) [ 16.677322] ixgbe 0000:03:00.0: MAC: 3, PHY: 0, PBA No: 000000-000 [ 16.684240] ixgbe 0000:03:00.0: 00:1e:67:f7:44:b3 [ 16.838449] ixgbe 0000:03:00.0: Intel(R) 10 Gigabit Network Connection [ 16.845921] libphy: ixgbe-mdio: probed [ 16.850249] IOAPIC[9]: Set IRTE entry (P:1 FPD:0 Dst_Mode:1 Redir_hint:1 Trig_Mode:0 Dlvry_Mode:0 Avail:0 Vector:EF Dest:00000001 SID:002C SQ:0 SVT:1) [ 16.865321] IOAPIC[1]: Set routing entry (9-10 -> 0xef -> IRQ 105 Mode:1 Active:1 Dest:1) [ 17.147743] ixgbe 0000:03:00.1: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0 [ 17.243466] ixgbe 0000:03:00.1: 32.000 Gb/s available PCIe bandwidth (5.0 GT/s PCIe x8 link) [ 17.276992] ixgbe 0000:03:00.1: MAC: 3, PHY: 0, PBA No: 000000-000 [ 17.283910] ixgbe 0000:03:00.1: 00:1e:67:f7:44:b4 [ 17.437334] ixgbe 0000:03:00.1: Intel(R) 10 Gigabit Network Connection [ 17.444791] libphy: ixgbe-mdio: probed [ 17.449025] i40e: Intel(R) Ethernet Connection XL710 Network Driver - version 2.8.20-k [ 17.457890] i40e: Copyright (c) 2013 - 2019 Intel Corporation. [ 17.465025] usbcore: registered new interface driver catc [ 17.471076] usbcore: registered new interface driver kaweth [ 17.477311] pegasus: v0.9.3 (2013/04/25), Pegasus/Pegasus II USB Ethernet driver [ 17.485597] usbcore: registered new interface driver pegasus [ 17.491949] usbcore: registered new interface driver rtl8150 [ 17.498295] usbcore: registered new interface driver asix [ 17.504340] usbcore: registered new interface driver cdc_ether [ 17.510870] usbcore: registered new interface driver cdc_eem [ 17.517207] usbcore: registered new interface driver dm9601 [ 17.523455] usbcore: registered new interface driver smsc75xx [ 17.529891] usbcore: registered new interface driver smsc95xx
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
Thanks, Rong Chen
On Fri, May 8, 2020 at 1:55 AM Dan Williams dan.j.williams@intel.com wrote:
Recently a performance problem was reported for a process invoking a non-trival ASL program. The method call in this case ends up repetitively triggering a call path like:
acpi_ex_store acpi_ex_store_object_to_node acpi_ex_write_data_to_field acpi_ex_insert_into_field acpi_ex_write_with_update_rule acpi_ex_field_datum_io acpi_ex_access_region acpi_ev_address_space_dispatch acpi_ex_system_memory_space_handler acpi_os_map_cleanup.part.14 _synchronize_rcu_expedited.constprop.89 schedule
The end result of frequent synchronize_rcu_expedited() invocation is tiny sub-millisecond spurts of execution where the scheduler freely migrates this apparently sleepy task. The overhead of frequent scheduler invocation multiplies the execution time by a factor of 2-3X.
For example, performance improves from 16 minutes to 7 minutes for a firmware update procedure across 24 devices.
Perhaps the rcu usage was intended to allow for not taking a sleeping lock in the acpi_os_{read,write}_memory() path which ostensibly could be called from an APEI NMI error interrupt?
Not really.
acpi_os_{read|write}_memory() end up being called from non-NMI interrupt context via acpi_hw_{read|write}(), respectively, and quite obviously ioremap() cannot be run from there, but in those cases the mappings in question are there in the list already in all cases and so the ioremap() isn't used then.
RCU is there to protect these users from walking the list while it is being updated.
Neither rcu_read_lock() nor ioremap() are interrupt safe, so add a WARN_ONCE() to validate that rcu was not serving as a mechanism to avoid direct calls to ioremap().
But it would produce false-positives if the IRQ context was not NMI, wouldn't it?
Even the original implementation had a spin_lock_irqsave(), but that is not NMI safe.
Which is not a problem (see above).
APEI itself already has some concept of avoiding ioremap() from interrupt context (see erst_exec_move_data()), if the new warning triggers it means that APEI either needs more instrumentation like that to pre-emptively fail, or more infrastructure to arrange for pre-mapping the resources it needs in NMI context.
Well, I'm not sure about that.
Thanks!
On Fri, Jun 5, 2020 at 6:32 AM Rafael J. Wysocki rafael@kernel.org wrote:
On Fri, May 8, 2020 at 1:55 AM Dan Williams dan.j.williams@intel.com wrote:
Recently a performance problem was reported for a process invoking a non-trival ASL program. The method call in this case ends up repetitively triggering a call path like:
acpi_ex_store acpi_ex_store_object_to_node acpi_ex_write_data_to_field acpi_ex_insert_into_field acpi_ex_write_with_update_rule acpi_ex_field_datum_io acpi_ex_access_region acpi_ev_address_space_dispatch acpi_ex_system_memory_space_handler acpi_os_map_cleanup.part.14 _synchronize_rcu_expedited.constprop.89 schedule
The end result of frequent synchronize_rcu_expedited() invocation is tiny sub-millisecond spurts of execution where the scheduler freely migrates this apparently sleepy task. The overhead of frequent scheduler invocation multiplies the execution time by a factor of 2-3X.
For example, performance improves from 16 minutes to 7 minutes for a firmware update procedure across 24 devices.
Perhaps the rcu usage was intended to allow for not taking a sleeping lock in the acpi_os_{read,write}_memory() path which ostensibly could be called from an APEI NMI error interrupt?
Not really.
acpi_os_{read|write}_memory() end up being called from non-NMI interrupt context via acpi_hw_{read|write}(), respectively, and quite obviously ioremap() cannot be run from there, but in those cases the mappings in question are there in the list already in all cases and so the ioremap() isn't used then.
RCU is there to protect these users from walking the list while it is being updated.
Neither rcu_read_lock() nor ioremap() are interrupt safe, so add a WARN_ONCE() to validate that rcu was not serving as a mechanism to avoid direct calls to ioremap().
But it would produce false-positives if the IRQ context was not NMI, wouldn't it?
Even the original implementation had a spin_lock_irqsave(), but that is not NMI safe.
Which is not a problem (see above).
APEI itself already has some concept of avoiding ioremap() from interrupt context (see erst_exec_move_data()), if the new warning triggers it means that APEI either needs more instrumentation like that to pre-emptively fail, or more infrastructure to arrange for pre-mapping the resources it needs in NMI context.
Well, I'm not sure about that.
Right, this patch set is about 2-3 generations behind the architecture of the fix we are discussing internally, you might mention that.
The fix we are looking at now is to pre-map operation regions in a similar manner as the way APEI resources are pre-mapped. The pre-mapping would arrange for synchronize_rcu_expedited() to be elided on each dynamic mapping attempt. The other piece is to arrange for operation-regions to be mapped at their full size at once rather than a page at a time.
On Fri, Jun 5, 2020 at 6:18 PM Dan Williams dan.j.williams@intel.com wrote:
On Fri, Jun 5, 2020 at 6:32 AM Rafael J. Wysocki rafael@kernel.org wrote:
On Fri, May 8, 2020 at 1:55 AM Dan Williams dan.j.williams@intel.com wrote:
Recently a performance problem was reported for a process invoking a non-trival ASL program. The method call in this case ends up repetitively triggering a call path like:
acpi_ex_store acpi_ex_store_object_to_node acpi_ex_write_data_to_field acpi_ex_insert_into_field acpi_ex_write_with_update_rule acpi_ex_field_datum_io acpi_ex_access_region acpi_ev_address_space_dispatch acpi_ex_system_memory_space_handler acpi_os_map_cleanup.part.14 _synchronize_rcu_expedited.constprop.89 schedule
The end result of frequent synchronize_rcu_expedited() invocation is tiny sub-millisecond spurts of execution where the scheduler freely migrates this apparently sleepy task. The overhead of frequent scheduler invocation multiplies the execution time by a factor of 2-3X.
For example, performance improves from 16 minutes to 7 minutes for a firmware update procedure across 24 devices.
Perhaps the rcu usage was intended to allow for not taking a sleeping lock in the acpi_os_{read,write}_memory() path which ostensibly could be called from an APEI NMI error interrupt?
Not really.
acpi_os_{read|write}_memory() end up being called from non-NMI interrupt context via acpi_hw_{read|write}(), respectively, and quite obviously ioremap() cannot be run from there, but in those cases the mappings in question are there in the list already in all cases and so the ioremap() isn't used then.
RCU is there to protect these users from walking the list while it is being updated.
Neither rcu_read_lock() nor ioremap() are interrupt safe, so add a WARN_ONCE() to validate that rcu was not serving as a mechanism to avoid direct calls to ioremap().
But it would produce false-positives if the IRQ context was not NMI, wouldn't it?
Even the original implementation had a spin_lock_irqsave(), but that is not NMI safe.
Which is not a problem (see above).
APEI itself already has some concept of avoiding ioremap() from interrupt context (see erst_exec_move_data()), if the new warning triggers it means that APEI either needs more instrumentation like that to pre-emptively fail, or more infrastructure to arrange for pre-mapping the resources it needs in NMI context.
Well, I'm not sure about that.
Right, this patch set is about 2-3 generations behind the architecture of the fix we are discussing internally, you might mention that.
Yes, sorry.
The fix we are looking at now is to pre-map operation regions in a similar manner as the way APEI resources are pre-mapped. The pre-mapping would arrange for synchronize_rcu_expedited() to be elided on each dynamic mapping attempt. The other piece is to arrange for operation-regions to be mapped at their full size at once rather than a page at a time.
However, if the RCU usage in ACPI OSL can be replaced with an rwlock, some of the ACPICA changes above may not be necessary anymore (even though some of them may still be worth making).
On Fri, Jun 5, 2020 at 9:22 AM Rafael J. Wysocki rafael@kernel.org wrote: [..]
The fix we are looking at now is to pre-map operation regions in a similar manner as the way APEI resources are pre-mapped. The pre-mapping would arrange for synchronize_rcu_expedited() to be elided on each dynamic mapping attempt. The other piece is to arrange for operation-regions to be mapped at their full size at once rather than a page at a time.
However, if the RCU usage in ACPI OSL can be replaced with an rwlock, some of the ACPICA changes above may not be necessary anymore (even though some of them may still be worth making).
I don't think you can replace the RCU usage in ACPI OSL and still maintain NMI lookups in a dynamic list.
However, there are 3 solutions I see:
- Prevent acpi_os_map_cleanup() from triggering at high frequency by pre-mapping and never unmapping operation-regions resources (internal discussion in progress)
- Prevent walks of the 'acpi_ioremaps' list (acpi_map_lookup_virt()) from NMI context by re-writing the physical addresses in the APEI tables with pre-mapped virtual address, i.e. remove rcu_read_lock() and list_for_each_entry_rcu() from NMI context.
- Split operation-region resources into a separate mapping mechanism than APEI resources so that typical locking can be used for the sleepable resources and let the NMI accessible resources be managed separately.
That last one is one we have not discussed internally, but it occurred to me when you mentioned replacing RCU.
On Fri, Jun 5, 2020 at 6:39 PM Dan Williams dan.j.williams@intel.com wrote:
On Fri, Jun 5, 2020 at 9:22 AM Rafael J. Wysocki rafael@kernel.org wrote: [..]
The fix we are looking at now is to pre-map operation regions in a similar manner as the way APEI resources are pre-mapped. The pre-mapping would arrange for synchronize_rcu_expedited() to be elided on each dynamic mapping attempt. The other piece is to arrange for operation-regions to be mapped at their full size at once rather than a page at a time.
However, if the RCU usage in ACPI OSL can be replaced with an rwlock, some of the ACPICA changes above may not be necessary anymore (even though some of them may still be worth making).
I don't think you can replace the RCU usage in ACPI OSL and still maintain NMI lookups in a dynamic list.
I'm not sure what NMI lookups have to do with the issue at hand.
If acpi_os_{read|write}_memory() is used from NMI, that is a bug already in there which is unrelated to the performance problem with opregions.
However, there are 3 solutions I see:
- Prevent acpi_os_map_cleanup() from triggering at high frequency by
pre-mapping and never unmapping operation-regions resources (internal discussion in progress)
Yes, that can be done, if necessary.
- Prevent walks of the 'acpi_ioremaps' list (acpi_map_lookup_virt())
from NMI context by re-writing the physical addresses in the APEI tables with pre-mapped virtual address, i.e. remove rcu_read_lock() and list_for_each_entry_rcu() from NMI context.
That sounds a bit convoluted to me.
- Split operation-region resources into a separate mapping mechanism
than APEI resources so that typical locking can be used for the sleepable resources and let the NMI accessible resources be managed separately.
That last one is one we have not discussed internally, but it occurred to me when you mentioned replacing RCU.
So NMI cannot use acpi_os_{read|write}_memory() safely which you have pointed out for a few times.
But even if NMI resources are managed separately, the others will still not be sleepable (at least not all of them).
Cheers!
From: Rafael J. Wysocki rafael.j.wysocki@intel.com Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
The ACPI OS layer uses RCU to protect the list of ACPI memory mappings from being walked while it is updated. Among other situations, that list can be walked in non-NMI interrupt context, so using a sleeping lock to protect it is not an option.
However, performance issues related to the RCU usage in there appear, as described by Dan Williams:
"Recently a performance problem was reported for a process invoking a non-trival ASL program. The method call in this case ends up repetitively triggering a call path like:
acpi_ex_store acpi_ex_store_object_to_node acpi_ex_write_data_to_field acpi_ex_insert_into_field acpi_ex_write_with_update_rule acpi_ex_field_datum_io acpi_ex_access_region acpi_ev_address_space_dispatch acpi_ex_system_memory_space_handler acpi_os_map_cleanup.part.14 _synchronize_rcu_expedited.constprop.89 schedule
The end result of frequent synchronize_rcu_expedited() invocation is tiny sub-millisecond spurts of execution where the scheduler freely migrates this apparently sleepy task. The overhead of frequent scheduler invocation multiplies the execution time by a factor of 2-3X."
In order to avoid these issues, replace the RCU in the ACPI OS layer by an rwlock.
That rwlock should not be frequently contended, so the performance impact of it is not expected to be significant.
Reported-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com ---
Hi Dan,
This is a possible fix for the ACPI OSL RCU-related performance issues, but can you please arrange for the testing of it on the affected systems?
Cheers!
--- drivers/acpi/osl.c | 50 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 34 insertions(+), 16 deletions(-)
Index: linux-pm/drivers/acpi/osl.c =================================================================== --- linux-pm.orig/drivers/acpi/osl.c +++ linux-pm/drivers/acpi/osl.c @@ -81,8 +81,8 @@ struct acpi_ioremap { };
static LIST_HEAD(acpi_ioremaps); +static DEFINE_RWLOCK(acpi_ioremaps_list_lock); static DEFINE_MUTEX(acpi_ioremap_lock); -#define acpi_ioremap_lock_held() lock_is_held(&acpi_ioremap_lock.dep_map)
static void __init acpi_request_region (struct acpi_generic_address *gas, unsigned int length, char *desc) @@ -214,13 +214,13 @@ acpi_physical_address __init acpi_os_get return pa; }
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */ +/* Must be called with 'acpi_ioremap_lock' or 'acpi_ioremaps_list_lock' held. */ static struct acpi_ioremap * acpi_map_lookup(acpi_physical_address phys, acpi_size size) { struct acpi_ioremap *map;
- list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held()) + list_for_each_entry(map, &acpi_ioremaps, list) if (map->phys <= phys && phys + size <= map->phys + map->size) return map; @@ -228,7 +228,7 @@ acpi_map_lookup(acpi_physical_address ph return NULL; }
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */ +/* Must be called with 'acpi_ioremap_lock' or 'acpi_ioremaps_list_lock' held. */ static void __iomem * acpi_map_vaddr_lookup(acpi_physical_address phys, unsigned int size) { @@ -257,13 +257,13 @@ void __iomem *acpi_os_get_iomem(acpi_phy } EXPORT_SYMBOL_GPL(acpi_os_get_iomem);
-/* Must be called with 'acpi_ioremap_lock' or RCU read lock held. */ +/* Must be called with 'acpi_ioremap_lock' or 'acpi_ioremaps_list_lock' held. */ static struct acpi_ioremap * acpi_map_lookup_virt(void __iomem *virt, acpi_size size) { struct acpi_ioremap *map;
- list_for_each_entry_rcu(map, &acpi_ioremaps, list, acpi_ioremap_lock_held()) + list_for_each_entry(map, &acpi_ioremaps, list) if (map->virt <= virt && virt + size <= map->virt + map->size) return map; @@ -360,7 +360,11 @@ void __iomem __ref map->size = pg_sz; map->refcount = 1;
- list_add_tail_rcu(&map->list, &acpi_ioremaps); + write_lock_irq(&acpi_ioremaps_list_lock); + + list_add_tail(&map->list, &acpi_ioremaps); + + write_unlock_irq(&acpi_ioremaps_list_lock);
out: mutex_unlock(&acpi_ioremap_lock); @@ -379,14 +383,18 @@ static unsigned long acpi_os_drop_map_re { unsigned long refcount = --map->refcount;
- if (!refcount) - list_del_rcu(&map->list); + if (!refcount) { + write_lock_irq(&acpi_ioremaps_list_lock); + + list_del(&map->list); + + write_unlock_irq(&acpi_ioremaps_list_lock); + } return refcount; }
static void acpi_os_map_cleanup(struct acpi_ioremap *map) { - synchronize_rcu_expedited(); acpi_unmap(map->phys, map->virt); kfree(map); } @@ -704,18 +712,23 @@ acpi_status acpi_os_read_memory(acpi_physical_address phys_addr, u64 *value, u32 width) { void __iomem *virt_addr; + unsigned long flags; unsigned int size = width / 8; bool unmap = false; u64 dummy; int error;
- rcu_read_lock(); + read_lock_irqsave(&acpi_ioremaps_list_lock, flags); + virt_addr = acpi_map_vaddr_lookup(phys_addr, size); if (!virt_addr) { - rcu_read_unlock(); + + read_unlock_irqrestore(&acpi_ioremaps_list_lock, flags); + virt_addr = acpi_os_ioremap(phys_addr, size); if (!virt_addr) return AE_BAD_ADDRESS; + unmap = true; }
@@ -728,7 +741,7 @@ acpi_os_read_memory(acpi_physical_addres if (unmap) iounmap(virt_addr); else - rcu_read_unlock(); + read_unlock_irqrestore(&acpi_ioremaps_list_lock, flags);
return AE_OK; } @@ -737,16 +750,21 @@ acpi_status acpi_os_write_memory(acpi_physical_address phys_addr, u64 value, u32 width) { void __iomem *virt_addr; + unsigned long flags; unsigned int size = width / 8; bool unmap = false;
- rcu_read_lock(); + read_lock_irqsave(&acpi_ioremaps_list_lock, flags); + virt_addr = acpi_map_vaddr_lookup(phys_addr, size); if (!virt_addr) { - rcu_read_unlock(); + + read_unlock_irqrestore(&acpi_ioremaps_list_lock, flags); + virt_addr = acpi_os_ioremap(phys_addr, size); if (!virt_addr) return AE_BAD_ADDRESS; + unmap = true; }
@@ -770,7 +788,7 @@ acpi_os_write_memory(acpi_physical_addre if (unmap) iounmap(virt_addr); else - rcu_read_unlock(); + read_unlock_irqrestore(&acpi_ioremaps_list_lock, flags);
return AE_OK; }
On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki rjw@rjwysocki.net wrote:
From: Rafael J. Wysocki rafael.j.wysocki@intel.com Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
The ACPI OS layer uses RCU to protect the list of ACPI memory mappings from being walked while it is updated. Among other situations, that list can be walked in non-NMI interrupt context, so using a sleeping lock to protect it is not an option.
However, performance issues related to the RCU usage in there appear, as described by Dan Williams:
"Recently a performance problem was reported for a process invoking a non-trival ASL program. The method call in this case ends up repetitively triggering a call path like:
acpi_ex_store acpi_ex_store_object_to_node acpi_ex_write_data_to_field acpi_ex_insert_into_field acpi_ex_write_with_update_rule acpi_ex_field_datum_io acpi_ex_access_region acpi_ev_address_space_dispatch acpi_ex_system_memory_space_handler acpi_os_map_cleanup.part.14 _synchronize_rcu_expedited.constprop.89 schedule
The end result of frequent synchronize_rcu_expedited() invocation is tiny sub-millisecond spurts of execution where the scheduler freely migrates this apparently sleepy task. The overhead of frequent scheduler invocation multiplies the execution time by a factor of 2-3X."
In order to avoid these issues, replace the RCU in the ACPI OS layer by an rwlock.
That rwlock should not be frequently contended, so the performance impact of it is not expected to be significant.
Reported-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com
Hi Dan,
This is a possible fix for the ACPI OSL RCU-related performance issues, but can you please arrange for the testing of it on the affected systems?
Ugh, is it really this simple? I did not realize the read-side is NMI safe. I'll take a look.
On Fri, Jun 5, 2020 at 7:09 PM Dan Williams dan.j.williams@intel.com wrote:
On Fri, Jun 5, 2020 at 7:06 AM Rafael J. Wysocki rjw@rjwysocki.net wrote:
From: Rafael J. Wysocki rafael.j.wysocki@intel.com Subject: [PATCH] ACPI: OSL: Use rwlock instead of RCU for memory management
The ACPI OS layer uses RCU to protect the list of ACPI memory mappings from being walked while it is updated. Among other situations, that list can be walked in non-NMI interrupt context, so using a sleeping lock to protect it is not an option.
However, performance issues related to the RCU usage in there appear, as described by Dan Williams:
"Recently a performance problem was reported for a process invoking a non-trival ASL program. The method call in this case ends up repetitively triggering a call path like:
acpi_ex_store acpi_ex_store_object_to_node acpi_ex_write_data_to_field acpi_ex_insert_into_field acpi_ex_write_with_update_rule acpi_ex_field_datum_io acpi_ex_access_region acpi_ev_address_space_dispatch acpi_ex_system_memory_space_handler acpi_os_map_cleanup.part.14 _synchronize_rcu_expedited.constprop.89 schedule
The end result of frequent synchronize_rcu_expedited() invocation is tiny sub-millisecond spurts of execution where the scheduler freely migrates this apparently sleepy task. The overhead of frequent scheduler invocation multiplies the execution time by a factor of 2-3X."
In order to avoid these issues, replace the RCU in the ACPI OS layer by an rwlock.
That rwlock should not be frequently contended, so the performance impact of it is not expected to be significant.
Reported-by: Dan Williams dan.j.williams@intel.com Signed-off-by: Rafael J. Wysocki rafael.j.wysocki@intel.com
Hi Dan,
This is a possible fix for the ACPI OSL RCU-related performance issues, but can you please arrange for the testing of it on the affected systems?
Ugh, is it really this simple? I did not realize the read-side is NMI safe. I'll take a look.
But if an NMI triggers while the lock is being held for writing, it will deadlock, won't it?
OTOH, according to the RCU documentation it is valid to call rcu_read_[un]lock() from an NMI handler (see Interrupts and NMIs in Documentation/RCU/Design/Requirements/Requirements.rst) so we are good from this perspective today.
Unless we teach APEI to avoid mapping lookups from apei_{read|write}(), which wouldn't be unreasonable by itself, we need to hold on to the RCU in ACPI OSL, so it looks like addressing the problem in ACPICA is the best way to do it (and the current ACPICA code in question is suboptimal, so it would be good to rework it anyway).
Cheers!
On Fri, Jun 5, 2020 at 5:11 PM Rafael J. Wysocki rjw@rjwysocki.net wrote:
...
if (!refcount) {
write_lock_irq(&acpi_ioremaps_list_lock);
list_del(&map->list);
write_unlock_irq(&acpi_ioremaps_list_lock);
} return refcount;
It seems we can decrease indentation level at the same time:
if (refcount) return refcount;
... return 0;
On Fri, Jun 5, 2020 at 9:40 PM Andy Shevchenko andy.shevchenko@gmail.com wrote:
On Fri, Jun 5, 2020 at 5:11 PM Rafael J. Wysocki rjw@rjwysocki.net wrote:
...
if (!refcount) {
write_lock_irq(&acpi_ioremaps_list_lock);
list_del(&map->list);
write_unlock_irq(&acpi_ioremaps_list_lock);
} return refcount;
It seems we can decrease indentation level at the same time:
if (refcount) return refcount;
... return 0;
Right, but the patch will need to be dropped anyway I think.
linux-stable-mirror@lists.linaro.org