[PATCH 0/5] cxl: Initialization and shutdown fixes

List overview All Threads
Download

newer

older

[PATCH] blk-mq: Make...

[merged]...

Dan Williams

11 Oct 2024 11 Oct '24

5:33 a.m.

Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to delayed arrival of the CXL "root" infrastructure [1] prompted questions of how the existing mechanism for retrying cxl_mem_probe() could be failing.

The critical missing piece in the debug was that Gregory's setup had almost all CXL modules built-in to the kernel.

On the way to that discovery several other bugs and init-order corner cases were discovered.

The main fix is to make sure the drivers/cxl/Makefile object order supports root CXL ports being fully initialized upon cxl_acpi_probe() exit. The modular case has some similar potential holes that are fixed with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update cxl_test to reproduce the original report resulted in the discovery of a separate long standing use after free bug in cxl_region_detach().

[1]: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net

---

Dan Williams (5): cxl/port: Fix CXL port initialization order when the subsystem is built-in cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices() cxl/acpi: Ensure ports ready at cxl_acpi_probe() return cxl/port: Fix use-after-free, permit out-of-order decoder shutdown cxl/test: Improve init-order fidelity relative to real-world systems

base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b

Show replies by date

Dan Williams

11 Oct 11 Oct

5:34 a.m.

New subject: [PATCH 1/5] cxl/port: Fix CXL port initialization order when the subsystem is built-in

When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upone cxl_acpi_probe() return. That in turn can only happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.

Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.

As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.

Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Cc: stable@vger.kernel.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Vishal Verma vishal.l.verma@intel.com Cc: Ira Weiny ira.weiny@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/cxl/Kconfig | 1 + drivers/cxl/Makefile | 12 ++++++------ 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index 29c192f20082..876469e23f7a 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -60,6 +60,7 @@ config CXL_ACPI default CXL_BUS select ACPI_TABLE_LIB select ACPI_HMAT + select CXL_PORT help Enable support for host managed device memory (HDM) resources published by a platform's ACPI CXL memory layout description. See diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile index db321f48ba52..374829359275 100644 --- a/drivers/cxl/Makefile +++ b/drivers/cxl/Makefile @@ -1,13 +1,13 @@ # SPDX-License-Identifier: GPL-2.0 obj-y += core/ -obj-$(CONFIG_CXL_PCI) += cxl_pci.o -obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PORT) += cxl_port.o obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o +obj-$(CONFIG_CXL_PCI) += cxl_pci.o obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o -obj-$(CONFIG_CXL_PORT) += cxl_port.o +obj-$(CONFIG_CXL_MEM) += cxl_mem.o

-cxl_mem-y := mem.o -cxl_pci-y := pci.o +cxl_port-y := port.o cxl_acpi-y := acpi.o +cxl_pci-y := pci.o cxl_pmem-y := pmem.o security.o -cxl_port-y := port.o +cxl_mem-y := mem.o

Jonathan Cameron

14 Oct 14 Oct

11:33 a.m.

New subject: [PATCH 1/5] cxl/port: Fix CXL port initialization order when the subsystem is built-in

On Thu, 10 Oct 2024 22:34:02 -0700 Dan Williams dan.j.williams@intel.com wrote:

...

When the CXL subsystem is built-in the module init order is determined by Makefile order. That order violates expectations. The expectation is that cxl_acpi and cxl_mem can race to attach and that if cxl_acpi wins the race cxl_mem will find the enabled CXL root ports it needs and if cxl_acpi loses the race it will retrigger cxl_mem to attach via cxl_bus_rescan(). That only works if cxl_acpi can assume ports are enabled immediately upone cxl_acpi_probe() return. That in turn can only

upon

...

happen in the CONFIG_CXL_ACPI=y case if the cxl_port object appears before the cxl_acpi object in the Makefile.

Fix up the order to prevent initialization failures, and make sure that cxl_port is built-in if cxl_acpi is also built-in.

As for what contributed to this not being found earlier, the CXL regression environment, cxl_test, builds all CXL functionality as a module to allow to symbol mocking and other dynamic reload tests. As a result there is no regression coverage for the built-in case.

My testing is all modular too :(

...

Reported-by: Gregory Price gourry@gourry.net Closes: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net Tested-by: Gregory Price gourry@gourry.net Fixes: 8dd2bc0f8e02 ("cxl/mem: Add the cxl_mem driver") Cc: stable@vger.kernel.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Vishal Verma vishal.l.verma@intel.com Cc: Ira Weiny ira.weiny@intel.com Signed-off-by: Dan Williams dan.j.williams@intel.com

drivers/cxl/Kconfig | 1 + drivers/cxl/Makefile | 12 ++++++------ 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig index 29c192f20082..876469e23f7a 100644 --- a/drivers/cxl/Kconfig +++ b/drivers/cxl/Kconfig @@ -60,6 +60,7 @@ config CXL_ACPI default CXL_BUS select ACPI_TABLE_LIB select ACPI_HMAT

select CXL_PORT help Enable support for host managed device memory (HDM) resources published by a platform's ACPI CXL memory layout description. See

diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile index db321f48ba52..374829359275 100644 --- a/drivers/cxl/Makefile +++ b/drivers/cxl/Makefile @@ -1,13 +1,13 @@ # SPDX-License-Identifier: GPL-2.0 obj-y += core/ -obj-$(CONFIG_CXL_PCI) += cxl_pci.o -obj-$(CONFIG_CXL_MEM) += cxl_mem.o +obj-$(CONFIG_CXL_PORT) += cxl_port.o obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o

Needs some comments on the ordering being required. Otherwise some future 'cleanup' will reorder them again. However relying on build order is nasty.

...

+obj-$(CONFIG_CXL_PCI) += cxl_pci.o obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o -obj-$(CONFIG_CXL_PORT) += cxl_port.o +obj-$(CONFIG_CXL_MEM) += cxl_mem.o -cxl_mem-y := mem.o -cxl_pci-y := pci.o +cxl_port-y := port.o cxl_acpi-y := acpi.o +cxl_pci-y := pci.o cxl_pmem-y := pmem.o security.o -cxl_port-y := port.o +cxl_mem-y := mem.o

Dan Williams

11 Oct 11 Oct

5:34 a.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature:

cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1 cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1 cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0 1) cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset 2) mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0: 3) cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace: <TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core]

At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted.

The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them.

In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction.

A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously.

Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: stable@vger.kernel.org Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Ira Weiny ira.weiny@intel.com Cc: Zijun Hu zijun_hu@icloud.com Signed-off-by: Dan Williams dan.j.williams@intel.com --- drivers/base/core.c | 35 +++++++++++++++++++++++++++++ drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- drivers/cxl/core/region.c | 48 +++++++++++----------------------------- drivers/cxl/cxl.h | 3 ++- include/linux/device.h | 3 +++ tools/testing/cxl/test/cxl.c | 14 ++++-------- 6 files changed, 100 insertions(+), 53 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c index a4c853411a6b..e42f1ad73078 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child_reverse);

+/** + * device_for_each_child_reverse_from - device child iterator in reversed order. + * @parent: parent struct device. + * @from: optional starting point in child list + * @fn: function to be called for each device. + * @data: data for the callback. + * + * Iterate over @parent's child devices, starting at @from, and call @fn + * for each, passing it @data. This helper is identical to + * device_for_each_child_reverse() when @from is NULL. + * + * @fn is checked each iteration. If it returns anything other than 0, + * iteration stop and that value is returned to the caller of + * device_for_each_child_reverse_from(); + */ +int device_for_each_child_reverse_from(struct device *parent, + struct device *from, const void *data, + int (*fn)(struct device *, const void *)) +{ + struct klist_iter i; + struct device *child; + int error = 0; + + if (!parent->p) + return 0; + + klist_iter_init_node(&parent->p->klist_children, &i, + (from ? &from->p->knode_parent : NULL)); + while ((child = prev_device(&i)) && !error) + error = fn(child, data); + klist_iter_exit(&i); + return error; +} +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from); + /** * device_find_child - device iterator for locating a particular device. * @parent: parent struct device diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c index 3df10517a327..223c273c0cd1 100644 --- a/drivers/cxl/core/hdm.c +++ b/drivers/cxl/core/hdm.c @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) return 0; }

-static int cxl_decoder_reset(struct cxl_decoder *cxld) +static int commit_reap(struct device *dev, const void *data) +{ + struct cxl_port *port = to_cxl_port(dev->parent); + struct cxl_decoder *cxld; + + if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev)) + return 0; + + cxld = to_cxl_decoder(dev); + if (port->commit_end == cxld->id && + ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) { + port->commit_end--; + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", + dev_name(&cxld->dev), port->commit_end); + } + + return 0; +} + +void cxl_port_commit_reap(struct cxl_decoder *cxld) +{ + struct cxl_port *port = to_cxl_port(cxld->dev.parent); + + lockdep_assert_held_write(&cxl_region_rwsem); + + /* + * Once the highest committed decoder is disabled, free any other + * decoders that were pinned allocated by out-of-order release. + */ + port->commit_end--; + dev_dbg(&port->dev, "reap: %s commit_end: %d\n", dev_name(&cxld->dev), + port->commit_end); + device_for_each_child_reverse_from(&port->dev, &cxld->dev, NULL, + commit_reap); +} +EXPORT_SYMBOL_NS_GPL(cxl_port_commit_reap, CXL); + +static void cxl_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); struct cxl_hdm *cxlhdm = dev_get_drvdata(&port->dev); @@ -721,14 +758,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) u32 ctrl;

if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) - return 0; + return;

- if (port->commit_end != id) { + if (port->commit_end == id) + cxl_port_commit_reap(cxld); + else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end); - return -EBUSY; - }

down_read(&cxl_dpa_rwsem); ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id)); @@ -741,7 +778,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) writel(0, hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id)); up_read(&cxl_dpa_rwsem);

- port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE;

/* Userspace is now responsible for reconfiguring this decoder */ @@ -751,8 +787,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) cxled = to_cxl_endpoint_decoder(&cxld->dev); cxled->state = CXL_DECODER_STATE_MANUAL; } - - return 0; }

static int cxl_setup_hdm_decoder_from_dvsec( diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index e701e4b04032..3478d2058303 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -232,8 +232,8 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) "Bypassing cpu_cache_invalidate_memregion() for testing!\n"); return 0; } else { - dev_err(&cxlr->dev, - "Failed to synchronize CPU cache state\n"); + dev_WARN(&cxlr->dev, + "Failed to synchronize CPU cache state\n"); return -ENXIO; } } @@ -242,19 +242,17 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) return 0; }

-static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) +static void cxl_region_decode_reset(struct cxl_region *cxlr, int count) { struct cxl_region_params *p = &cxlr->params; - int i, rc = 0; + int i;

/* - * Before region teardown attempt to flush, and if the flush - * fails cancel the region teardown for data consistency - * concerns + * Before region teardown attempt to flush, evict any data cached for + * this region, or scream loudly about missing arch / platform support + * for CXL teardown. */ - rc = cxl_region_invalidate_memregion(cxlr); - if (rc) - return rc; + cxl_region_invalidate_memregion(cxlr);

for (i = count - 1; i >= 0; i--) { struct cxl_endpoint_decoder *cxled = p->targets[i]; @@ -277,23 +275,17 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) cxl_rr = cxl_rr_load(iter, cxlr); cxld = cxl_rr->decoder; if (cxld->reset) - rc = cxld->reset(cxld); - if (rc) - return rc; + cxld->reset(cxld); set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); }

endpoint_reset: - rc = cxled->cxld.reset(&cxled->cxld); - if (rc) - return rc; + cxled->cxld.reset(&cxled->cxld); set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); }

/* all decoders associated with this region have been torn down */ clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); - - return 0; }

static int commit_decoder(struct cxl_decoder *cxld) @@ -409,16 +401,8 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr, * still pending. */ if (p->state == CXL_CONFIG_RESET_PENDING) { - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); - /* - * Revert to committed since there may still be active - * decoders associated with this region, or move forward - * to active to mark the reset successful - */ - if (rc) - p->state = CXL_CONFIG_COMMIT; - else - p->state = CXL_CONFIG_ACTIVE; + cxl_region_decode_reset(cxlr, p->interleave_ways); + p->state = CXL_CONFIG_ACTIVE; } }

@@ -2054,13 +2038,7 @@ static int cxl_region_detach(struct cxl_endpoint_decoder *cxled) get_device(&cxlr->dev);

if (p->state > CXL_CONFIG_ACTIVE) { - /* - * TODO: tear down all impacted regions if a device is - * removed out of order - */ - rc = cxl_region_decode_reset(cxlr, p->interleave_ways); - if (rc) - goto out; + cxl_region_decode_reset(cxlr, p->interleave_ways); p->state = CXL_CONFIG_ACTIVE; }

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 0d8b810a51f0..5406e3ab3d4a 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -359,7 +359,7 @@ struct cxl_decoder { struct cxl_region *region; unsigned long flags; int (*commit)(struct cxl_decoder *cxld); - int (*reset)(struct cxl_decoder *cxld); + void (*reset)(struct cxl_decoder *cxld); };

/* @@ -730,6 +730,7 @@ static inline bool is_cxl_root(struct cxl_port *port) int cxl_num_decoders_committed(struct cxl_port *port); bool is_cxl_port(const struct device *dev); struct cxl_port *to_cxl_port(const struct device *dev); +void cxl_port_commit_reap(struct cxl_decoder *cxld); struct pci_bus; int devm_cxl_register_pci_bus(struct device *host, struct device *uport_dev, struct pci_bus *bus); diff --git a/include/linux/device.h b/include/linux/device.h index b4bde8d22697..667cb6db9019 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1078,6 +1078,9 @@ int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); int device_for_each_child_reverse(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); +int device_for_each_child_reverse_from(struct device *parent, + struct device *from, const void *data, + int (*fn)(struct device *, const void *)); struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); struct device *device_find_child_by_name(struct device *parent, diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c index 90d5afd52dd0..c5bbd89b3192 100644 --- a/tools/testing/cxl/test/cxl.c +++ b/tools/testing/cxl/test/cxl.c @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld) return 0; }

-static int mock_decoder_reset(struct cxl_decoder *cxld) +static void mock_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); int id = cxld->id;

if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0) - return 0; + return;

dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev)); - if (port->commit_end != id) { + if (port->commit_end == id) + cxl_port_commit_reap(cxld); + else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end); - return -EBUSY; - } - - port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE; - - return 0; }

static void default_mock_decoder(struct cxl_decoder *cxld)

Zijun Hu

11:50 a.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

On 2024/10/11 13:34, Dan Williams wrote:

...

In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature:
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset

mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0:

cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace:
<TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core]

At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted.

The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them.

In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction.

A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously.

Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: stable@vger.kernel.org Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Ira Weiny ira.weiny@intel.com Cc: Zijun Hu zijun_hu@icloud.com Signed-off-by: Dan Williams dan.j.williams@intel.com

drivers/base/core.c | 35 +++++++++++++++++++++++++++++ drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- drivers/cxl/core/region.c | 48 +++++++++++----------------------------- drivers/cxl/cxl.h | 3 ++- include/linux/device.h | 3 +++ tools/testing/cxl/test/cxl.c | 14 ++++-------- 6 files changed, 100 insertions(+), 53 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c index a4c853411a6b..e42f1ad73078 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child_reverse); +/**

device_for_each_child_reverse_from - device child iterator in reversed order.

@parent: parent struct device.

@from: optional starting point in child list

@fn: function to be called for each device.

@data: data for the callback.

Iterate over @parent's child devices, starting at @from, and call @fn

for each, passing it @data. This helper is identical to

device_for_each_child_reverse() when @from is NULL.

@fn is checked each iteration. If it returns anything other than 0,

iteration stop and that value is returned to the caller of

device_for_each_child_reverse_from();

*/

+int device_for_each_child_reverse_from(struct device *parent,
		       struct device *from, const void *data,
		       int (*fn)(struct device *, const void *))
+{
struct klist_iter i;

struct device *child;

int error = 0;

if (!parent->p)
return 0;
klist_iter_init_node(&parent->p->klist_children, &i,
	     (from ? &from->p->knode_parent : NULL));
while ((child = prev_device(&i)) && !error)
error = fn(child, data);
klist_iter_exit(&i);

return error;
+} +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from);

it does NOT deserve, also does NOT need to introduce a new core driver API device_for_each_child_reverse_from(). existing device_for_each_child_reverse() can do what the _from() wants to do.

we can use similar approach as below link shown: https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@qui...

...

/**

device_find_child - device iterator for locating a particular device.

@parent: parent struct device

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c index 3df10517a327..223c273c0cd1 100644 --- a/drivers/cxl/core/hdm.c +++ b/drivers/cxl/core/hdm.c @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) return 0; } -static int cxl_decoder_reset(struct cxl_decoder *cxld) +static int commit_reap(struct device *dev, const void *data) +{
struct cxl_port *port = to_cxl_port(dev->parent);

struct cxl_decoder *cxld;

if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev))
return 0;
cxld = to_cxl_decoder(dev);

if (port->commit_end == cxld->id &&
   ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
port->commit_end--;
dev_dbg(&port->dev, "reap: %s commit_end: %d\n",
	dev_name(&cxld->dev), port->commit_end);
}

return 0;
+}

+void cxl_port_commit_reap(struct cxl_decoder *cxld) +{
struct cxl_port *port = to_cxl_port(cxld->dev.parent);

lockdep_assert_held_write(&cxl_region_rwsem);

/*
* Once the highest committed decoder is disabled, free any other
* decoders that were pinned allocated by out-of-order release.
*/
port->commit_end--;

dev_dbg(&port->dev, "reap: %s commit_end: %d\n", dev_name(&cxld->dev),
port->commit_end);
device_for_each_child_reverse_from(&port->dev, &cxld->dev, NULL,
			   commit_reap);
+} +EXPORT_SYMBOL_NS_GPL(cxl_port_commit_reap, CXL);

+static void cxl_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); struct cxl_hdm *cxlhdm = dev_get_drvdata(&port->dev); @@ -721,14 +758,14 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) u32 ctrl; if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
return 0;
return;
if (port->commit_end != id) {
if (port->commit_end == id)
cxl_port_commit_reap(cxld);
else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end);
return -EBUSY;
}
down_read(&cxl_dpa_rwsem); ctrl = readl(hdm + CXL_HDM_DECODER0_CTRL_OFFSET(id)); @@ -741,7 +778,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) writel(0, hdm + CXL_HDM_DECODER0_BASE_LOW_OFFSET(id)); up_read(&cxl_dpa_rwsem);

port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE;

/* Userspace is now responsible for reconfiguring this decoder */ @@ -751,8 +787,6 @@ static int cxl_decoder_reset(struct cxl_decoder *cxld) cxled = to_cxl_endpoint_decoder(&cxld->dev); cxled->state = CXL_DECODER_STATE_MANUAL; }

return 0;

} static int cxl_setup_hdm_decoder_from_dvsec( diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index e701e4b04032..3478d2058303 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -232,8 +232,8 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) "Bypassing cpu_cache_invalidate_memregion() for testing!\n"); return 0; } else {
	dev_err(&cxlr->dev,
		"Failed to synchronize CPU cache state\n");
	dev_WARN(&cxlr->dev,
		 "Failed to synchronize CPU cache state\n");
return -ENXIO;
} }
@@ -242,19 +242,17 @@ static int cxl_region_invalidate_memregion(struct cxl_region *cxlr) return 0; } -static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) +static void cxl_region_decode_reset(struct cxl_region *cxlr, int count) { struct cxl_region_params *p = &cxlr->params;

int i, rc = 0;

int i;

/*
* Before region teardown attempt to flush, and if the flush
* fails cancel the region teardown for data consistency
* concerns
* Before region teardown attempt to flush, evict any data cached for
* this region, or scream loudly about missing arch / platform support
* for CXL teardown.
*/
rc = cxl_region_invalidate_memregion(cxlr);

if (rc)
return rc;
cxl_region_invalidate_memregion(cxlr);

for (i = count - 1; i >= 0; i--) { struct cxl_endpoint_decoder *cxled = p->targets[i]; @@ -277,23 +275,17 @@ static int cxl_region_decode_reset(struct cxl_region *cxlr, int count) cxl_rr = cxl_rr_load(iter, cxlr); cxld = cxl_rr->decoder; if (cxld->reset)
		rc = cxld->reset(cxld);
	if (rc)
		return rc;
		cxld->reset(cxld);
set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);
}
endpoint_reset:
rc = cxled->cxld.reset(&cxled->cxld);
if (rc)
	return rc;
cxled->cxld.reset(&cxled->cxld);
set_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags); }
/* all decoders associated with this region have been torn down */ clear_bit(CXL_REGION_F_NEEDS_RESET, &cxlr->flags);

return 0;

} static int commit_decoder(struct cxl_decoder *cxld) @@ -409,16 +401,8 @@ static ssize_t commit_store(struct device *dev, struct device_attribute *attr, * still pending. */ if (p->state == CXL_CONFIG_RESET_PENDING) {
	rc = cxl_region_decode_reset(cxlr, p->interleave_ways);
	/*
	 * Revert to committed since there may still be active
	 * decoders associated with this region, or move forward
	 * to active to mark the reset successful
	 */
	if (rc)
		p->state = CXL_CONFIG_COMMIT;
	else
		p->state = CXL_CONFIG_ACTIVE;
	cxl_region_decode_reset(cxlr, p->interleave_ways);
	p->state = CXL_CONFIG_ACTIVE;
} }
@@ -2054,13 +2038,7 @@ static int cxl_region_detach(struct cxl_endpoint_decoder *cxled) get_device(&cxlr->dev); if (p->state > CXL_CONFIG_ACTIVE) {
/*
 * TODO: tear down all impacted regions if a device is
 * removed out of order
 */
rc = cxl_region_decode_reset(cxlr, p->interleave_ways);
if (rc)
	goto out;
cxl_region_decode_reset(cxlr, p->interleave_ways);
p->state = CXL_CONFIG_ACTIVE; }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index 0d8b810a51f0..5406e3ab3d4a 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -359,7 +359,7 @@ struct cxl_decoder { struct cxl_region *region; unsigned long flags; int (*commit)(struct cxl_decoder *cxld);

int (*reset)(struct cxl_decoder *cxld);

void (*reset)(struct cxl_decoder *cxld);

}; /* @@ -730,6 +730,7 @@ static inline bool is_cxl_root(struct cxl_port *port) int cxl_num_decoders_committed(struct cxl_port *port); bool is_cxl_port(const struct device *dev); struct cxl_port *to_cxl_port(const struct device *dev); +void cxl_port_commit_reap(struct cxl_decoder *cxld); struct pci_bus; int devm_cxl_register_pci_bus(struct device *host, struct device *uport_dev, struct pci_bus *bus); diff --git a/include/linux/device.h b/include/linux/device.h index b4bde8d22697..667cb6db9019 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1078,6 +1078,9 @@ int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); int device_for_each_child_reverse(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); +int device_for_each_child_reverse_from(struct device *parent,
		       struct device *from, const void *data,
		       int (*fn)(struct device *, const void *));
struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); struct device *device_find_child_by_name(struct device *parent, diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c index 90d5afd52dd0..c5bbd89b3192 100644 --- a/tools/testing/cxl/test/cxl.c +++ b/tools/testing/cxl/test/cxl.c @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld) return 0; } -static int mock_decoder_reset(struct cxl_decoder *cxld) +static void mock_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); int id = cxld->id; if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
return 0;
return;
dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev));

if (port->commit_end != id) {
if (port->commit_end == id)
cxl_port_commit_reap(cxld);
else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end);
return -EBUSY;
}

port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE;

return 0;
} static void default_mock_decoder(struct cxl_decoder *cxld)

Dan Williams

5:46 p.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

Zijun Hu wrote:

...

On 2024/10/11 13:34, Dan Williams wrote:

...
In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature:
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset

mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0:

cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace:
<TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core]

At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted.

The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them.

In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction.

A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously.

Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: stable@vger.kernel.org Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Ira Weiny ira.weiny@intel.com Cc: Zijun Hu zijun_hu@icloud.com Signed-off-by: Dan Williams dan.j.williams@intel.com

drivers/base/core.c | 35 +++++++++++++++++++++++++++++ drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- drivers/cxl/core/region.c | 48 +++++++++++----------------------------- drivers/cxl/cxl.h | 3 ++- include/linux/device.h | 3 +++ tools/testing/cxl/test/cxl.c | 14 ++++-------- 6 files changed, 100 insertions(+), 53 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c index a4c853411a6b..e42f1ad73078 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child_reverse); +/**

device_for_each_child_reverse_from - device child iterator in reversed order.

@parent: parent struct device.

@from: optional starting point in child list

@fn: function to be called for each device.

@data: data for the callback.

Iterate over @parent's child devices, starting at @from, and call @fn

for each, passing it @data. This helper is identical to

device_for_each_child_reverse() when @from is NULL.

@fn is checked each iteration. If it returns anything other than 0,

iteration stop and that value is returned to the caller of

device_for_each_child_reverse_from();

*/

+int device_for_each_child_reverse_from(struct device *parent,
		       struct device *from, const void *data,
		       int (*fn)(struct device *, const void *))
+{
struct klist_iter i;

struct device *child;

int error = 0;

if (!parent->p)
return 0;
klist_iter_init_node(&parent->p->klist_children, &i,
	     (from ? &from->p->knode_parent : NULL));
while ((child = prev_device(&i)) && !error)
error = fn(child, data);
klist_iter_exit(&i);

return error;
+} +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from);
it does NOT deserve, also does NOT need to introduce a new core driver API device_for_each_child_reverse_from(). existing device_for_each_child_reverse() can do what the _from() wants to do.

we can use similar approach as below link shown: https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@qui...

No, just have a simple starting point parameter. I understand that more logic can be placed around device_for_each_child_reverse() to achieve the same effect, but the core helpers should be removing logic from consumers, not forcing them to add more.

If bloat is a concern, then after your const cleanups go through device_for_each_child_reverse() can be rewritten in terms of device_for_each_child_reverse_from() as (untested):

diff --git a/drivers/base/core.c b/drivers/base/core.c index e42f1ad73078..2571c910da46 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4007,36 +4007,6 @@ int device_for_each_child(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child);

-/** - * device_for_each_child_reverse - device child iterator in reversed order. - * @parent: parent struct device. - * @fn: function to be called for each device. - * @data: data for the callback. - * - * Iterate over @parent's child devices, and call @fn for each, - * passing it @data. - * - * We check the return of @fn each time. If it returns anything - * other than 0, we break out and return that value. - */ -int device_for_each_child_reverse(struct device *parent, void *data, - int (*fn)(struct device *dev, void *data)) -{ - struct klist_iter i; - struct device *child; - int error = 0; - - if (!parent || !parent->p) - return 0; - - klist_iter_init(&parent->p->klist_children, &i); - while ((child = prev_device(&i)) && !error) - error = fn(child, data); - klist_iter_exit(&i); - return error; -} -EXPORT_SYMBOL_GPL(device_for_each_child_reverse); - /** * device_for_each_child_reverse_from - device child iterator in reversed order. * @parent: parent struct device. diff --git a/include/linux/device.h b/include/linux/device.h index 667cb6db9019..96a2c072bf5b 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1076,11 +1076,14 @@ DEFINE_FREE(device_del, struct device *, if (_T) device_del(_T))

int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); -int device_for_each_child_reverse(struct device *dev, void *data, - int (*fn)(struct device *dev, void *data)); int device_for_each_child_reverse_from(struct device *parent, struct device *from, const void *data, int (*fn)(struct device *, const void *)); +static inline int device_for_each_child_reverse(struct device *dev, const void *data, + int (*fn)(struct device *, const void *)) +{ + return device_for_each_child_reverse_from(dev, NULL, data, fn); +} struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); struct device *device_find_child_by_name(struct device *parent,

Zijun Hu

11:40 p.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

On 2024/10/12 01:46, Dan Williams wrote:

...

Zijun Hu wrote:

...
On 2024/10/11 13:34, Dan Williams wrote:

...
In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature:
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset

mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0:

cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace:
<TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core]

At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted.

The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them.

In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction.

A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously.

Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: stable@vger.kernel.org Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Ira Weiny ira.weiny@intel.com Cc: Zijun Hu zijun_hu@icloud.com Signed-off-by: Dan Williams dan.j.williams@intel.com

drivers/base/core.c | 35 +++++++++++++++++++++++++++++ drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- drivers/cxl/core/region.c | 48 +++++++++++----------------------------- drivers/cxl/cxl.h | 3 ++- include/linux/device.h | 3 +++ tools/testing/cxl/test/cxl.c | 14 ++++-------- 6 files changed, 100 insertions(+), 53 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c index a4c853411a6b..e42f1ad73078 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child_reverse); +/**

device_for_each_child_reverse_from - device child iterator in reversed order.

@parent: parent struct device.

@from: optional starting point in child list

@fn: function to be called for each device.

@data: data for the callback.

Iterate over @parent's child devices, starting at @from, and call @fn

for each, passing it @data. This helper is identical to

device_for_each_child_reverse() when @from is NULL.

@fn is checked each iteration. If it returns anything other than 0,

iteration stop and that value is returned to the caller of

device_for_each_child_reverse_from();

*/

+int device_for_each_child_reverse_from(struct device *parent,
		       struct device *from, const void *data,
		       int (*fn)(struct device *, const void *))
+{
struct klist_iter i;

struct device *child;

int error = 0;

if (!parent->p)
return 0;
klist_iter_init_node(&parent->p->klist_children, &i,
	     (from ? &from->p->knode_parent : NULL));
while ((child = prev_device(&i)) && !error)
error = fn(child, data);
klist_iter_exit(&i);

return error;
+} +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from);
it does NOT deserve, also does NOT need to introduce a new core driver API device_for_each_child_reverse_from(). existing device_for_each_child_reverse() can do what the _from() wants to do.

we can use similar approach as below link shown: https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@qui...
No, just have a simple starting point parameter. I understand that more logic can be placed around device_for_each_child_reverse() to achieve the same effect, but the core helpers should be removing logic from consumers, not forcing them to add more.

If bloat is a concern, then after your const cleanups go through device_for_each_child_reverse() can be rewritten in terms of device_for_each_child_reverse_from() as (untested):

bloat is one aspect, the other aspect is that there are redundant between both driver core APIs, namely, there are a question:

why to still need device_for_each_child_reverse() if it is same as _from(..., NULL, ...) ?

So i suggest use existing API now. if there are more users who have such starting point requirement, then add the parameter into device_for_each_child_reverse() which is consistent with other existing *_for_each_*() core APIs such as (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may need much efforts.

could you please contains your proposal "fixing this allocation order validation" of below link into this patch series with below reason? and Cc me (^^)

https://lore.kernel.org/all/670835f5a2887_964f229474@dwillia2-xfh.jf.intel.c...

A) the proposal depends on this patch series. B) one of the issues the proposal fix is match_free_decoder() error logic which is also relevant issues this patch series fix, the proposal also can fix the other device_find_child()'s match() issue i care about.

C) Actually, it is a bit difficult for me to understand the proposal since i don't have any basic knowledge about CXL. (^^)

...

diff --git a/drivers/base/core.c b/drivers/base/core.c index e42f1ad73078..2571c910da46 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4007,36 +4007,6 @@ int device_for_each_child(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child); -/**

device_for_each_child_reverse - device child iterator in reversed order.

@parent: parent struct device.

@fn: function to be called for each device.

@data: data for the callback.

Iterate over @parent's child devices, and call @fn for each,

passing it @data.

We check the return of @fn each time. If it returns anything

other than 0, we break out and return that value.

*/

-int device_for_each_child_reverse(struct device *parent, void *data,
		  int (*fn)(struct device *dev, void *data))
-{
struct klist_iter i;

struct device *child;

int error = 0;

if (!parent || !parent->p)
return 0;
klist_iter_init(&parent->p->klist_children, &i);

while ((child = prev_device(&i)) && !error)
error = fn(child, data);
klist_iter_exit(&i);

return error;
-} -EXPORT_SYMBOL_GPL(device_for_each_child_reverse);

/**

device_for_each_child_reverse_from - device child iterator in reversed order.

@parent: parent struct device.

diff --git a/include/linux/device.h b/include/linux/device.h index 667cb6db9019..96a2c072bf5b 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1076,11 +1076,14 @@ DEFINE_FREE(device_del, struct device *, if (_T) device_del(_T)) int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); -int device_for_each_child_reverse(struct device *dev, void *data,
		  int (*fn)(struct device *dev, void *data));
int device_for_each_child_reverse_from(struct device *parent, struct device *from, const void *data, int (*fn)(struct device *, const void *)); +static inline int device_for_each_child_reverse(struct device *dev, const void *data,
				int (*fn)(struct device *, const void *))
+{

return device_for_each_child_reverse_from(dev, NULL, data, fn);

+} struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); struct device *device_find_child_by_name(struct device *parent,

Gregory Price

12 Oct 12 Oct

5:56 p.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

On Sat, Oct 12, 2024 at 07:40:13AM +0800, Zijun Hu wrote:

...

On 2024/10/12 01:46, Dan Williams wrote:

...
Zijun Hu wrote:

...
On 2024/10/11 13:34, Dan Williams wrote:

...
In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature:
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset

mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0:

cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace:
<TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core]

At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted.

The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them.

In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction.

A new device_for_each_child_reverse_from() is added to cleanup port->commit_end after all dependent decoders have been disabled. In other words if decoders are allocated 0->1->2 and disabled 1->2->0 then port->commit_end only decrements from 2 after 2 has been disabled, and it decrements all the way to zero since 1 was disabled previously.

Link: http://lore.kernel.org/20241004212504.1246-1-gourry@gourry.net [1] Cc: stable@vger.kernel.org Fixes: 176baefb2eb5 ("cxl/hdm: Commit decoder state to hardware") Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Davidlohr Bueso dave@stgolabs.net Cc: Jonathan Cameron jonathan.cameron@huawei.com Cc: Dave Jiang dave.jiang@intel.com Cc: Alison Schofield alison.schofield@intel.com Cc: Ira Weiny ira.weiny@intel.com Cc: Zijun Hu zijun_hu@icloud.com Signed-off-by: Dan Williams dan.j.williams@intel.com

drivers/base/core.c | 35 +++++++++++++++++++++++++++++ drivers/cxl/core/hdm.c | 50 +++++++++++++++++++++++++++++++++++------- drivers/cxl/core/region.c | 48 +++++++++++----------------------------- drivers/cxl/cxl.h | 3 ++- include/linux/device.h | 3 +++ tools/testing/cxl/test/cxl.c | 14 ++++-------- 6 files changed, 100 insertions(+), 53 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c index a4c853411a6b..e42f1ad73078 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -4037,6 +4037,41 @@ int device_for_each_child_reverse(struct device *parent, void *data, } EXPORT_SYMBOL_GPL(device_for_each_child_reverse); +/**

device_for_each_child_reverse_from - device child iterator in reversed order.

@parent: parent struct device.

@from: optional starting point in child list

@fn: function to be called for each device.

@data: data for the callback.

Iterate over @parent's child devices, starting at @from, and call @fn

for each, passing it @data. This helper is identical to

device_for_each_child_reverse() when @from is NULL.

@fn is checked each iteration. If it returns anything other than 0,

iteration stop and that value is returned to the caller of

device_for_each_child_reverse_from();

*/

+int device_for_each_child_reverse_from(struct device *parent,
		       struct device *from, const void *data,
		       int (*fn)(struct device *, const void *))
+{
struct klist_iter i;

struct device *child;

int error = 0;

if (!parent->p)
return 0;
klist_iter_init_node(&parent->p->klist_children, &i,
	     (from ? &from->p->knode_parent : NULL));
while ((child = prev_device(&i)) && !error)
error = fn(child, data);
klist_iter_exit(&i);

return error;
+} +EXPORT_SYMBOL_GPL(device_for_each_child_reverse_from);
it does NOT deserve, also does NOT need to introduce a new core driver API device_for_each_child_reverse_from(). existing device_for_each_child_reverse() can do what the _from() wants to do.

we can use similar approach as below link shown: https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@qui...
No, just have a simple starting point parameter. I understand that more logic can be placed around device_for_each_child_reverse() to achieve the same effect, but the core helpers should be removing logic from consumers, not forcing them to add more.

If bloat is a concern, then after your const cleanups go through device_for_each_child_reverse() can be rewritten in terms of device_for_each_child_reverse_from() as (untested):
bloat is one aspect, the other aspect is that there are redundant between both driver core APIs, namely, there are a question:

why to still need device_for_each_child_reverse() if it is same as _from(..., NULL, ...) ?

This same pattern (_reverse and _from_reverse) is present in list.h and other iterators. Why would it be contentious here?

Reducing _reverse() to be a wrapper of _from_reverse is a nice way of reducing the bloat/redundancy without having to update every current user - this is a very common refactor pattern.

Refactoring without disrupting in-flight work is intrinsically valuable.

...

...
Iterate over @parent's child devices, and call @fn for each,

passing it @data.

We check the return of @fn each time. If it returns anything

other than 0, we break out and return that value.

*/

-int device_for_each_child_reverse(struct device *parent, void *data,
		  int (*fn)(struct device *dev, void *data))
-{
struct klist_iter i;

struct device *child;

int error = 0;

if (!parent || !parent->p)
return 0;
klist_iter_init(&parent->p->klist_children, &i);

while ((child = prev_device(&i)) && !error)
error = fn(child, data);
klist_iter_exit(&i);

return error;
-} -EXPORT_SYMBOL_GPL(device_for_each_child_reverse);

/**

device_for_each_child_reverse_from - device child iterator in reversed order.

@parent: parent struct device.

diff --git a/include/linux/device.h b/include/linux/device.h index 667cb6db9019..96a2c072bf5b 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1076,11 +1076,14 @@ DEFINE_FREE(device_del, struct device *, if (_T) device_del(_T)) int device_for_each_child(struct device *dev, void *data, int (*fn)(struct device *dev, void *data)); -int device_for_each_child_reverse(struct device *dev, void *data,
		  int (*fn)(struct device *dev, void *data));
int device_for_each_child_reverse_from(struct device *parent, struct device *from, const void *data, int (*fn)(struct device *, const void *)); +static inline int device_for_each_child_reverse(struct device *dev, const void *data,
				int (*fn)(struct device *, const void *))
+{

return device_for_each_child_reverse_from(dev, NULL, data, fn);

+} struct device *device_find_child(struct device *dev, void *data, int (*match)(struct device *dev, void *data)); struct device *device_find_child_by_name(struct device *parent,

Dan Williams

10:16 p.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

Zijun Hu wrote: [..]

...

...
...
it does NOT deserve, also does NOT need to introduce a new core driver API device_for_each_child_reverse_from(). existing device_for_each_child_reverse() can do what the _from() wants to do.

we can use similar approach as below link shown: https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@qui...

No, just have a simple starting point parameter. I understand that more logic can be placed around device_for_each_child_reverse() to achieve the same effect, but the core helpers should be removing logic from consumers, not forcing them to add more.

If bloat is a concern, then after your const cleanups go through device_for_each_child_reverse() can be rewritten in terms of device_for_each_child_reverse_from() as (untested):

bloat is one aspect, the other aspect is that there are redundant between both driver core APIs, namely, there are a question:

why to still need device_for_each_child_reverse() if it is same as _from(..., NULL, ...) ?

To allow access to the new functionality without burdening existing callers. With device_for_each_child_reverse() re-written to internally use device_for_each_child_reverse_from() Linux gets more functionality with no disruption to existing users and no bloat. This is typical API evolution.

...

So i suggest use existing API now.

No, I am failing to understand your concern.

...

if there are more users who have such starting point requirement, then add the parameter into device_for_each_child_reverse() which is consistent with other existing *_for_each_*() core APIs such as (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may need much efforts.

There are ~370 existing device_for_each* callers. Let's not thrash them.

Introduce new superset calls with the additional parameter and then rewrite the old routines to just have a simple wrapper that passes a NULL @start parameter.

Now, if Greg has the appetite to go touch all ~370 existing callers, so be it, but introducing a superset-iterator-helper and a compat-wrapper for legacy is the path I would take.

...

could you please contains your proposal "fixing this allocation order validation" of below link into this patch series with below reason? and Cc me (^^)

https://lore.kernel.org/all/670835f5a2887_964f229474@dwillia2-xfh.jf.intel.c...

A) the proposal depends on this patch series. B) one of the issues the proposal fix is match_free_decoder() error logic which is also relevant issues this patch series fix, the proposal also can fix the other device_find_child()'s match() issue i care about.

C) Actually, it is a bit difficult for me to understand the proposal since i don't have any basic knowledge about CXL. (^^)

If I understand your question correctly you are asking how does device_for_each_child_reverse_from() get used in cxl_region_find_decoder() to enforce in-order allocation?

The recommendation is the following:

-- 8< -- diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index 3478d2058303..32cde18ff31b 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos) return rc; }

+static int check_commit_order(struct device *dev, const void *data) +{ + struct cxl_decoder *cxld = to_cxl_decoder(dev); + + /* + * if port->commit_end is not the only free decoder, then out of + * order shutdown has occurred, block further allocations until + * that is resolved + */ + if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) + return -EBUSY; + return 0; +} + static int match_free_decoder(struct device *dev, void *data) { + struct cxl_port *port = to_cxl_port(dev->parent); struct cxl_decoder *cxld; - int *id = data; + int rc;

if (!is_switch_decoder(dev)) return 0;

cxld = to_cxl_decoder(dev);

- /* enforce ordered allocation */ - if (cxld->id != *id) + if (cxld->id != port->commit_end + 1) return 0;

- if (!cxld->region) - return 1; - - (*id)++; + if (cxld->region) { + dev_dbg(dev->parent, + "next decoder to commit is already reserved\n", + dev_name(dev)); + return 0; + }

- return 0; + rc = device_for_each_child_reverse_from(dev->parent, dev, NULL, + check_commit_order); + if (rc) { + dev_dbg(dev->parent, + "unable to allocate %s due to out of order shutdown\n", + dev_name(dev)); + return 0; + } + return 1; }

static int match_auto_decoder(struct device *dev, void *data) @@ -824,7 +848,6 @@ cxl_region_find_decoder(struct cxl_port *port, struct cxl_region *cxlr) { struct device *dev; - int id = 0;

if (port == cxled_to_port(cxled)) return &cxled->cxld; @@ -833,7 +856,7 @@ cxl_region_find_decoder(struct cxl_port *port, dev = device_find_child(&port->dev, &cxlr->params, match_auto_decoder); else - dev = device_find_child(&port->dev, &id, match_free_decoder); + dev = device_find_child(&port->dev, NULL, match_free_decoder); if (!dev) return NULL; /*

Zijun Hu

14 Oct 14 Oct

1:29 a.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

On 2024/10/13 06:16, Dan Williams wrote:

...

Zijun Hu wrote: [..]

...
...
...
it does NOT deserve, also does NOT need to introduce a new core driver API device_for_each_child_reverse_from(). existing device_for_each_child_reverse() can do what the _from() wants to do.

we can use similar approach as below link shown: https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@qui...

No, just have a simple starting point parameter. I understand that more logic can be placed around device_for_each_child_reverse() to achieve the same effect, but the core helpers should be removing logic from consumers, not forcing them to add more.

If bloat is a concern, then after your const cleanups go through device_for_each_child_reverse() can be rewritten in terms of device_for_each_child_reverse_from() as (untested):

bloat is one aspect, the other aspect is that there are redundant between both driver core APIs, namely, there are a question:

why to still need device_for_each_child_reverse() if it is same as _from(..., NULL, ...) ?

To allow access to the new functionality without burdening existing callers. With device_for_each_child_reverse() re-written to internally use device_for_each_child_reverse_from() Linux gets more functionality with no disruption to existing users and no bloat. This is typical API evolution.

...
So i suggest use existing API now.

No, I am failing to understand your concern.

...
if there are more users who have such starting point requirement, then add the parameter into device_for_each_child_reverse() which is consistent with other existing *_for_each_*() core APIs such as (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may need much efforts.

There are ~370 existing device_for_each* callers. Let's not thrash them.

Introduce new superset calls with the additional parameter and then rewrite the old routines to just have a simple wrapper that passes a NULL @start parameter.

Now, if Greg has the appetite to go touch all ~370 existing callers, so be it, but introducing a superset-iterator-helper and a compat-wrapper for legacy is the path I would take.

current kernel tree ONLY has 15 usages of device_for_each_child_reverse(), i would like to add an extra parameter @start as existing (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do if it is required.

...

...
could you please contains your proposal "fixing this allocation order validation" of below link into this patch series with below reason? and Cc me (^^)

https://lore.kernel.org/all/670835f5a2887_964f229474@dwillia2-xfh.jf.intel.c...

A) the proposal depends on this patch series. B) one of the issues the proposal fix is match_free_decoder() error logic which is also relevant issues this patch series fix, the proposal also can fix the other device_find_child()'s match() issue i care about.

C) Actually, it is a bit difficult for me to understand the proposal since i don't have any basic knowledge about CXL. (^^)

If I understand your question correctly you are asking how does device_for_each_child_reverse_from() get used in cxl_region_find_decoder() to enforce in-order allocation?

yes. your recommendation may help me understand it.

...

The recommendation is the following:

-- 8< -- diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index 3478d2058303..32cde18ff31b 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos) return rc; } +static int check_commit_order(struct device *dev, const void *data) +{
struct cxl_decoder *cxld = to_cxl_decoder(dev);

/*
* if port->commit_end is not the only free decoder, then out of
* order shutdown has occurred, block further allocations until
* that is resolved
*/
if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0))
return -EBUSY;
return 0;
+}

static int match_free_decoder(struct device *dev, void *data) {

struct cxl_port *port = to_cxl_port(dev->parent); struct cxl_decoder *cxld;

int *id = data;

int rc;

if (!is_switch_decoder(dev)) return 0; cxld = to_cxl_decoder(dev);

/* enforce ordered allocation */

if (cxld->id != *id)

if (cxld->id != port->commit_end + 1) return 0;

for a port, is it possible that there are multiple CXLDs with same IDs?

...

if (!cxld->region)
return 1;
(*id)++;
if (cxld->region) {
dev_dbg(dev->parent,
	"next decoder to commit is already reserved\n",
	dev_name(dev));
return 0;
}
return 0;
rc = device_for_each_child_reverse_from(dev->parent, dev, NULL,
				check_commit_order);
if (rc) {
dev_dbg(dev->parent,
	"unable to allocate %s due to out of order shutdown\n",
	dev_name(dev));
return 0;
}

return 1;
} static int match_auto_decoder(struct device *dev, void *data) @@ -824,7 +848,6 @@ cxl_region_find_decoder(struct cxl_port *port, struct cxl_region *cxlr) { struct device *dev;

int id = 0;

if (port == cxled_to_port(cxled)) return &cxled->cxld; @@ -833,7 +856,7 @@ cxl_region_find_decoder(struct cxl_port *port, dev = device_find_child(&port->dev, &cxlr->params, match_auto_decoder); else
dev = device_find_child(&port->dev, &id, match_free_decoder);
dev = device_find_child(&port->dev, NULL, match_free_decoder);
if (!dev) return NULL; /*

Dan Williams

7:32 p.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

Zijun Hu wrote:

...

On 2024/10/13 06:16, Dan Williams wrote:

...
Zijun Hu wrote: [..]

...
...
...
it does NOT deserve, also does NOT need to introduce a new core driver API device_for_each_child_reverse_from(). existing device_for_each_child_reverse() can do what the _from() wants to do.

we can use similar approach as below link shown: https://lore.kernel.org/all/20240815-const_dfc_prepare-v2-2-8316b87b8ff9@qui...

No, just have a simple starting point parameter. I understand that more logic can be placed around device_for_each_child_reverse() to achieve the same effect, but the core helpers should be removing logic from consumers, not forcing them to add more.

If bloat is a concern, then after your const cleanups go through device_for_each_child_reverse() can be rewritten in terms of device_for_each_child_reverse_from() as (untested):

bloat is one aspect, the other aspect is that there are redundant between both driver core APIs, namely, there are a question:

why to still need device_for_each_child_reverse() if it is same as _from(..., NULL, ...) ?

To allow access to the new functionality without burdening existing callers. With device_for_each_child_reverse() re-written to internally use device_for_each_child_reverse_from() Linux gets more functionality with no disruption to existing users and no bloat. This is typical API evolution.

...
So i suggest use existing API now.

No, I am failing to understand your concern.

...
if there are more users who have such starting point requirement, then add the parameter into device_for_each_child_reverse() which is consistent with other existing *_for_each_*() core APIs such as (class|driver|bus)_for_each_device() and bus_for_each_drv(), that may need much efforts.

There are ~370 existing device_for_each* callers. Let's not thrash them.

Introduce new superset calls with the additional parameter and then rewrite the old routines to just have a simple wrapper that passes a NULL @start parameter.

Now, if Greg has the appetite to go touch all ~370 existing callers, so be it, but introducing a superset-iterator-helper and a compat-wrapper for legacy is the path I would take.

current kernel tree ONLY has 15 usages of device_for_each_child_reverse(), i would like to add an extra parameter @start as existing (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do if it is required.

A new parameter to a new wrapper symbol sounds fine to me. Otherwise, please do not go thrash all the call sites to pass an unused NULL @start parameter. Just accept that device_for_each_* did not follow the {class,driver,bus}_for_each_* example and instead introduce a new symbol to wrap the new functionality that so far only has the single CXL user.

[..]

...

...
If I understand your question correctly you are asking how does device_for_each_child_reverse_from() get used in cxl_region_find_decoder() to enforce in-order allocation?

yes. your recommendation may help me understand it.

...
The recommendation is the following:

-- 8< -- diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index 3478d2058303..32cde18ff31b 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos) return rc; } +static int check_commit_order(struct device *dev, const void *data) +{
struct cxl_decoder *cxld = to_cxl_decoder(dev);

/*
* if port->commit_end is not the only free decoder, then out of
* order shutdown has occurred, block further allocations until
* that is resolved
*/
if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0))
return -EBUSY;
return 0;
+}

static int match_free_decoder(struct device *dev, void *data) {

struct cxl_port *port = to_cxl_port(dev->parent); struct cxl_decoder *cxld;

int *id = data;

int rc;

if (!is_switch_decoder(dev)) return 0; cxld = to_cxl_decoder(dev);

/* enforce ordered allocation */

if (cxld->id != *id)

if (cxld->id != port->commit_end + 1) return 0;
for a port, is it possible that there are multiple CXLDs with same IDs?

Not possible, that is also enforced by the driver core, all kernel object names must be unique (at least if they have the same parent), and the local subsystem convention is that all decoders are registered in id-order.

Zijun Hu

15 Oct 15 Oct

12:02 a.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

On 2024/10/15 03:32, Dan Williams wrote:

...

Zijun Hu wrote:

...
On 2024/10/13 06:16, Dan Williams wrote:

...
Zijun Hu wrote: [..]

...
...
...
it does NOT deserve, also does NOT need to introduce a new core driver API device_for_each_child_reverse_from(). existing device_for_each_child_reverse() can do what the _from() wants to do.

[snip]

...

...
...
Introduce new superset calls with the additional parameter and then rewrite the old routines to just have a simple wrapper that passes a NULL @start parameter.

Now, if Greg has the appetite to go touch all ~370 existing callers, so be it, but introducing a superset-iterator-helper and a compat-wrapper for legacy is the path I would take.

current kernel tree ONLY has 15 usages of device_for_each_child_reverse(), i would like to add an extra parameter @start as existing (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do if it is required.

A new parameter to a new wrapper symbol sounds fine to me. Otherwise, please do not go thrash all the call sites to pass an unused NULL @start parameter. Just accept that device_for_each_* did not follow the {class,driver,bus}_for_each_* example and instead introduce a new symbol to wrap the new functionality that so far only has the single CXL user.

you maybe regard my idea as a alternative proposal if Greg dislike introducing a new core driver API. (^^)

...

[..]

...
...
If I understand your question correctly you are asking how does device_for_each_child_reverse_from() get used in cxl_region_find_decoder() to enforce in-order allocation?

yes. your recommendation may help me understand it.

below simple solution should have same effect as your recommendation. also have below optimizations:

1) it don't need new core API. 2) it is more efficient since it has minimal iterating.

i will submit it if you like it. (^^)

index e701e4b04032..37da42329ff3 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -796,8 +796,9 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos)

static int match_free_decoder(struct device *dev, void *data) { + struct cxl_port *port = to_cxl_port(dev->parent); struct cxl_decoder *cxld; - int *id = data; + struct device **target_dev = data;

if (!is_switch_decoder(dev)) return 0; @@ -805,15 +806,19 @@ static int match_free_decoder(struct device *dev, void *data) cxld = to_cxl_decoder(dev);

/* enforce ordered allocation */ - if (cxld->id != *id) - return 0; - - if (!cxld->region) - return 1; - - (*id)++; - - return 0; + if (cxld->id == port->commit_end + 1) { + if (!cxld->region) { + *target_dev = dev; + return 1; + } else { + dev_dbg(dev->parent, + "next decoder to commit is already reserved\n", + dev_name(dev)); + return -ENODEV; + } + } else { + return cxld->flags & CXL_DECODER_F_ENABLE ? 0 : -EBUSY; + } }

static int match_auto_decoder(struct device *dev, void *data) @@ -839,7 +844,7 @@ cxl_region_find_decoder(struct cxl_port *port, struct cxl_endpoint_decoder *cxled, struct cxl_region *cxlr) { - struct device *dev; + struct device *dev = NULL; int id = 0;

if (port == cxled_to_port(cxled)) @@ -848,8 +853,8 @@ cxl_region_find_decoder(struct cxl_port *port, if (test_bit(CXL_REGION_F_AUTO, &cxlr->flags)) dev = device_find_child(&port->dev, &cxlr->params, match_auto_decoder); - else - dev = device_find_child(&port->dev, &id, match_free_decoder); + else if (device_for_each_child(&port->dev, &dev, match_free_decoder) > 0) + get_device(dev); if (!dev) return NULL; /*

...

...
...
The recommendation is the following:

-- 8< -- diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index 3478d2058303..32cde18ff31b 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -778,26 +778,50 @@ static size_t show_targetN(struct cxl_region *cxlr, char *buf, int pos) return rc; } +static int check_commit_order(struct device *dev, const void *data) +{
struct cxl_decoder *cxld = to_cxl_decoder(dev);

/*
* if port->commit_end is not the only free decoder, then out of
* order shutdown has occurred, block further allocations until
* that is resolved
*/
if (((cxld->flags & CXL_DECODER_F_ENABLE) == 0))
return -EBUSY;
return 0;
+}

static int match_free_decoder(struct device *dev, void *data) {

struct cxl_port *port = to_cxl_port(dev->parent); struct cxl_decoder *cxld;

int *id = data;

int rc;

if (!is_switch_decoder(dev)) return 0; cxld = to_cxl_decoder(dev);

/* enforce ordered allocation */

if (cxld->id != *id)

if (cxld->id != port->commit_end + 1) return 0;
for a port, is it possible that there are multiple CXLDs with same IDs?
Not possible, that is also enforced by the driver core, all kernel object names must be unique (at least if they have the same parent), and the local subsystem convention is that all decoders are registered in id-order.

Dan Williams

12:10 a.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

Zijun Hu wrote:

...

On 2024/10/15 03:32, Dan Williams wrote:

...
Zijun Hu wrote:

...
On 2024/10/13 06:16, Dan Williams wrote:

...
Zijun Hu wrote: [..]

...
...
> it does NOT deserve, also does NOT need to introduce a new core driver > API device_for_each_child_reverse_from(). existing > device_for_each_child_reverse() can do what the _from() wants to do. >

[snip]

...
...
...
Introduce new superset calls with the additional parameter and then rewrite the old routines to just have a simple wrapper that passes a NULL @start parameter.

Now, if Greg has the appetite to go touch all ~370 existing callers, so be it, but introducing a superset-iterator-helper and a compat-wrapper for legacy is the path I would take.

current kernel tree ONLY has 15 usages of device_for_each_child_reverse(), i would like to add an extra parameter @start as existing (class|driver)_for_each_device() and bus_for_each_(dev|drv)() do if it is required.

A new parameter to a new wrapper symbol sounds fine to me. Otherwise, please do not go thrash all the call sites to pass an unused NULL @start parameter. Just accept that device_for_each_* did not follow the {class,driver,bus}_for_each_* example and instead introduce a new symbol to wrap the new functionality that so far only has the single CXL user.

you maybe regard my idea as a alternative proposal if Greg dislike introducing a new core driver API. (^^)

If the proposal is to add an unwanted parameter to existing call sites then yes, I would NAK that.

...

...
[..]

...
...
If I understand your question correctly you are asking how does device_for_each_child_reverse_from() get used in cxl_region_find_decoder() to enforce in-order allocation?

yes. your recommendation may help me understand it.

below simple solution should have same effect as your recommendation. also have below optimizations:

it don't need new core API.

it is more efficient since it has minimal iterating.

i will submit it if you like it. (^^)

See the patch I just submitted, it does not handle the case of competing allocations. The cxld->region check is not sufficient for determining that the decoder is committed.

Jonathan Cameron

4:47 p.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

On Thu, 10 Oct 2024 22:34:26 -0700 Dan Williams dan.j.williams@intel.com wrote:

...

In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature:
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset

mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0:

cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace:
<TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core]

At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted.

The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them.

In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction.

I'm fine with this, but seems like it is worth breaking out as a precursor where we can discuss merits of that change separate from the complexity of the rest.

I don't mind that strongly though so if you keep this intact, Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com

Trivial passing comment inline.

...

diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c index 3df10517a327..223c273c0cd1 100644 --- a/drivers/cxl/core/hdm.c +++ b/drivers/cxl/core/hdm.c @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) return 0; } -static int cxl_decoder_reset(struct cxl_decoder *cxld) +static int commit_reap(struct device *dev, const void *data) +{
struct cxl_port *port = to_cxl_port(dev->parent);

struct cxl_decoder *cxld;

if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev))
return 0;
cxld = to_cxl_decoder(dev);

if (port->commit_end == cxld->id &&
   ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {

I'd have gone with !(cxld->flags & CXL_DECODER_F_ENABLE) but this is consistent with exiting form, so fine as is.

...

```
port->commit_end--;
```

dev_dbg(&port->dev, "reap: %s commit_end: %d\n",

	dev_name(&cxld->dev), port->commit_end);

}
return 0;

...

diff --git a/tools/testing/cxl/test/cxl.c b/tools/testing/cxl/test/cxl.c index 90d5afd52dd0..c5bbd89b3192 100644 --- a/tools/testing/cxl/test/cxl.c +++ b/tools/testing/cxl/test/cxl.c @@ -693,26 +693,22 @@ static int mock_decoder_commit(struct cxl_decoder *cxld) return 0; } -static int mock_decoder_reset(struct cxl_decoder *cxld) +static void mock_decoder_reset(struct cxl_decoder *cxld) { struct cxl_port *port = to_cxl_port(cxld->dev.parent); int id = cxld->id; if ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)
return 0;
return;
dev_dbg(&port->dev, "%s reset\n", dev_name(&cxld->dev));

if (port->commit_end != id) {
if (port->commit_end == id)
cxl_port_commit_reap(cxld);
else dev_dbg(&port->dev, "%s: out of order reset, expected decoder%d.%d\n", dev_name(&cxld->dev), port->id, port->commit_end);
return -EBUSY;
}

port->commit_end--; cxld->flags &= ~CXL_DECODER_F_ENABLE;

return 0;
} static void default_mock_decoder(struct cxl_decoder *cxld)

Dan Williams

23 Oct 23 Oct

12:31 a.m.

New subject: [PATCH 4/5] cxl/port: Fix use-after-free, permit out-of-order decoder shutdown

Jonathan Cameron wrote:

...

On Thu, 10 Oct 2024 22:34:26 -0700 Dan Williams dan.j.williams@intel.com wrote:

...
In support of investigating an initialization failure report [1], cxl_test was updated to register mock memory-devices after the mock root-port/bus device had been registered. That led to cxl_test crashing with a use-after-free bug with the following signature:
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem0:decoder7.0 @ 0 next: cxl_switch_uport.0 nr_eps: 1 nr_targets: 1
cxl_port_attach_region: cxl region3: cxl_host_bridge.0:port3 decoder3.0 add: mem4:decoder14.0 @ 1 next: cxl_switch_uport.0 nr_eps: 2 nr_targets: 1
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[0] = cxl_switch_dport.0 for mem0:decoder7.0 @ 0
cxl_port_setup_targets: cxl region3: cxl_switch_uport.0:port6 target[1] = cxl_switch_dport.4 for mem4:decoder14.0 @ 1 [..] cxld_unregister: cxl decoder14.0: cxl_region_decode_reset: cxl_region region3: mock_decoder_reset: cxl_port port3: decoder3.0 reset

mock_decoder_reset: cxl_port port3: decoder3.0: out of order reset, expected decoder3.1 cxl_endpoint_decoder_release: cxl decoder14.0: [..] cxld_unregister: cxl decoder7.0:

cxl_region_decode_reset: cxl_region region3: Oops: general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bc3: 0000 [#1] PREEMPT SMP PTI [..] RIP: 0010:to_cxl_port+0x8/0x60 [cxl_core] [..] Call Trace:
<TASK> cxl_region_decode_reset+0x69/0x190 [cxl_core] cxl_region_detach+0xe8/0x210 [cxl_core] cxl_decoder_kill_region+0x27/0x40 [cxl_core] cxld_unregister+0x5d/0x60 [cxl_core]

At 1) a region has been established with 2 endpoint decoders (7.0 and 14.0). Those endpoints share a common switch-decoder in the topology (3.0). At teardown, 2), decoder14.0 is the first to be removed and hits the "out of order reset case" in the switch decoder. The effect though is that region3 cleanup is aborted leaving it in-tact and referencing decoder14.0. At 3) the second attempt to teardown region3 trips over the stale decoder14.0 object which has long since been deleted.

The fix here is to recognize that the CXL specification places no mandate on in-order shutdown of switch-decoders, the driver enforces in-order allocation, and hardware enforces in-order commit. So, rather than fail and leave objects dangling, always remove them.

In support of making cxl_region_decode_reset() always succeed, cxl_region_invalidate_memregion() failures are turned into warnings. Crashing the kernel is ok there since system integrity is at risk if caches cannot be managed around physical address mutation events like CXL region destruction.
I'm fine with this, but seems like it is worth breaking out as a precursor where we can discuss merits of that change separate from the complexity of the rest.

I don't mind that strongly though so if you keep this intact,

If there are merits to discuss, let's discuss them in this patch (in v2) because if cxl_region_invalidate_memregion() failures are not suitable to be warnings then that invalidates this patch.

...

Reviewed-by: Jonathan Cameron Jonathan.Cameron@huawei.com

Trivial passing comment inline.

...
diff --git a/drivers/cxl/core/hdm.c b/drivers/cxl/core/hdm.c index 3df10517a327..223c273c0cd1 100644 --- a/drivers/cxl/core/hdm.c +++ b/drivers/cxl/core/hdm.c @@ -712,7 +712,44 @@ static int cxl_decoder_commit(struct cxl_decoder *cxld) return 0; } -static int cxl_decoder_reset(struct cxl_decoder *cxld) +static int commit_reap(struct device *dev, const void *data) +{
struct cxl_port *port = to_cxl_port(dev->parent);

struct cxl_decoder *cxld;

if (!is_switch_decoder(dev) && !is_endpoint_decoder(dev))
return 0;
cxld = to_cxl_decoder(dev);

if (port->commit_end == cxld->id &&
   ((cxld->flags & CXL_DECODER_F_ENABLE) == 0)) {
I'd have gone with !(cxld->flags & CXL_DECODER_F_ENABLE) but this is consistent with exiting form, so fine as is.

I have long had an aversion to negation operators for the small speedbump to left-to-right readability as evidenced by all the other "(cxld->flags & CXL_DECODER_F_ENABLE) == 0" in drivers/cxl/.

Alejandro Lucero Palau

11 Oct 11 Oct

11:21 a.m.

Hi Dan,

I think this is the same issue one of the patches in type2 support tries to deal with:

https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-pal...

If this fixes that situation, I guess I can drop that one from v4 which is ready to be sent.

The other problem I try to fix in that patch, the endpoint not being there when that code tries to use it, it is likely not needed either, although I have a trivial fix for it now instead of that ugly loop with delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for the cxl_mem_driver which implies the device_add will only return when the device is really created. Maybe that is worth it for other potential situations suffering the delayed creation.

On 10/11/24 06:33, Dan Williams wrote:

...

Gregory's modest proposal to fix CXL cxl_mem_probe() failures due to delayed arrival of the CXL "root" infrastructure [1] prompted questions of how the existing mechanism for retrying cxl_mem_probe() could be failing.

The critical missing piece in the debug was that Gregory's setup had almost all CXL modules built-in to the kernel.

On the way to that discovery several other bugs and init-order corner cases were discovered.

The main fix is to make sure the drivers/cxl/Makefile object order supports root CXL ports being fully initialized upon cxl_acpi_probe() exit. The modular case has some similar potential holes that are fixed with MODULE_SOFTDEP() and other fix ups. Finally, an attempt to update cxl_test to reproduce the original report resulted in the discovery of a separate long standing use after free bug in cxl_region_detach().

Dan Williams (5): cxl/port: Fix CXL port initialization order when the subsystem is built-in cxl/port: Fix cxl_bus_rescan() vs bus_rescan_devices() cxl/acpi: Ensure ports ready at cxl_acpi_probe() return cxl/port: Fix use-after-free, permit out-of-order decoder shutdown cxl/test: Improve init-order fidelity relative to real-world systems

drivers/base/core.c | 35 +++++++ drivers/cxl/Kconfig | 1 drivers/cxl/Makefile | 12 +-- drivers/cxl/acpi.c | 7 + drivers/cxl/core/hdm.c | 50 +++++++++-- drivers/cxl/core/port.c | 13 ++- drivers/cxl/core/region.c | 48 +++------- drivers/cxl/cxl.h | 3 - include/linux/device.h | 3 + tools/testing/cxl/test/cxl.c | 200 +++++++++++++++++++++++------------------- tools/testing/cxl/test/mem.c | 1 11 files changed, 228 insertions(+), 145 deletions(-)

base-commit: 8cf0b93919e13d1e8d4466eb4080a4c4d9d66d7b

Dan Williams

5:38 p.m.

Alejandro Lucero Palau wrote:

...

Hi Dan,

I think this is the same issue one of the patches in type2 support tries to deal with:

https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-pal...

If this fixes that situation, I guess I can drop that one from v4 which is ready to be sent.

The other problem I try to fix in that patch, the endpoint not being there when that code tries to use it, it is likely not needed either, although I have a trivial fix for it now instead of that ugly loop with delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for the cxl_mem_driver which implies the device_add will only return when the device is really created. Maybe that is worth it for other potential situations suffering the delayed creation.

I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any device-readiness bug. Some other assumption is violated if that is required.

For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the assumption that an accelerator driver might want to wait until CXL is initialized before the base accelerator proceeds. However, if accelerator drivers behave the same as the cxl_pci driver and are ok with asynchronus arrival of CXL functionality then no deferral is needed.

Otherwise, the only motivation for synchronous probing I can think of would be to have more predictable naming of kernel objects. So yes, I would be curious to understand what scenarios probe deferral is still needed.

Alejandro Lucero Palau

12 Oct 12 Oct

6:30 a.m.

On 10/11/24 18:38, Dan Williams wrote:

...

Alejandro Lucero Palau wrote:

...
Hi Dan,

I think this is the same issue one of the patches in type2 support tries to deal with:

https://lore.kernel.org/linux-cxl/20240907081836.5801-1-alejandro.lucero-pal...

If this fixes that situation, I guess I can drop that one from v4 which is ready to be sent.

The other problem I try to fix in that patch, the endpoint not being there when that code tries to use it, it is likely not needed either, although I have a trivial fix for it now instead of that ugly loop with delays. The solution is to add PROBE_FORCE_SYNCHRONOUS as probe_type for the cxl_mem_driver which implies the device_add will only return when the device is really created. Maybe that is worth it for other potential situations suffering the delayed creation.

I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any device-readiness bug. Some other assumption is violated if that is required.

But that problem is not about device readiness but just how the device model works. In this case the memdev creation is adding devices, no real ones but those abstractions we use from the device model, and that device creation is done asynchronously. When the code creating the memdev, a Type2 driver in my case, is going to work with such a device abstraction just after the memdev creation, it is not there yet. It is true that clx_mem_probe will interact with the real device, but the fact is such a function is not invoked by the device model synchronously, so the code using such a device abstraction needs to wait until it is there. With this flag the waiting is implicit to device creation. Without that flag other "nasty dancing" with delays and checks needs to be done as the code in v3 did.

...

For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the assumption that an accelerator driver might want to wait until CXL is initialized before the base accelerator proceeds. However, if accelerator drivers behave the same as the cxl_pci driver and are ok with asynchronus arrival of CXL functionality then no deferral is needed.

I think deferring the accel driver makes sense. In the sfc driver case, it could work without CXL and then change to use it once the CXL kernel support is fully initialised, but I guess other accel drivers will rely on CXL with no other option, and even with the sfc driver, supporting such a change will make the code far more complex.

...

Otherwise, the only motivation for synchronous probing I can think of would be to have more predictable naming of kernel objects. So yes, I would be curious to understand what scenarios probe deferral is still needed.

OK. I will keep that patch with the last change in the v4. Let's discuss this further with that patch as a reference.

Thanks.

Dan Williams

9:57 p.m.

Alejandro Lucero Palau wrote: [..]

...

...
I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any device-readiness bug. Some other assumption is violated if that is required.

But that problem is not about device readiness but just how the device model works. In this case the memdev creation is adding devices, no real ones but those abstractions we use from the device model, and that device creation is done asynchronously.

Device creation is not done asynchronously, the PCI driver is attaching asynchrounously. When the PCI driver attaches it creates memdevs and those are attached to cxl_mem synchronously.

...

memdev, a Type2 driver in my case, is going to work with such a device abstraction just after the memdev creation, it is not there yet.

Oh, is the concern that you always want to have the memdev attached to cxl_mem immediately after it is registered?

I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is needed. However, to fix this situation once and for all I think I would rather just drop all this modularity and move both cxl_port and cxl_mem to be drivers internal to cxl_core.ko similar to the cxl_region driver.

...

It is true that clx_mem_probe will interact with the real device, but the fact is such a function is not invoked by the device model synchronously, so the code using such a device abstraction needs to wait until it is there. With this flag the waiting is implicit to device creation. Without that flag other "nasty dancing" with delays and checks needs to be done as the code in v3 did.

It is invoked synchronously *if* the driver is loaded *and* the user has not forced asynchronous attach on the command line in which case they get to keep the pieces.

...

...
For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the assumption that an accelerator driver might want to wait until CXL is initialized before the base accelerator proceeds. However, if accelerator drivers behave the same as the cxl_pci driver and are ok with asynchronus arrival of CXL functionality then no deferral is needed.

I think deferring the accel driver makes sense. In the sfc driver case, it could work without CXL and then change to use it once the CXL kernel support is fully initialised, but I guess other accel drivers will rely on CXL with no other option, and even with the sfc driver, supporting such a change will make the code far more complex.

Makes sense.

...

...
Otherwise, the only motivation for synchronous probing I can think of would be to have more predictable naming of kernel objects. So yes, I would be curious to understand what scenarios probe deferral is still needed.

OK. I will keep that patch with the last change in the v4. Let's discuss this further with that patch as a reference.

EPROBE_DEFER when CXL is not ready yet is fine in the sfc driver, just comment that the driver does not have a PCIe-only operation mode and requires that CXL initializes.

Alejandro Lucero Palau

14 Oct 14 Oct

3:13 p.m.

On 10/12/24 22:57, Dan Williams wrote:

...

Alejandro Lucero Palau wrote: [..]

...
...
I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any device-readiness bug. Some other assumption is violated if that is required.

But that problem is not about device readiness but just how the device model works. In this case the memdev creation is adding devices, no real ones but those abstractions we use from the device model, and that device creation is done asynchronously.

Device creation is not done asynchronously, the PCI driver is attaching asynchrounously. When the PCI driver attaches it creates memdevs and those are attached to cxl_mem synchronously.

...
memdev, a Type2 driver in my case, is going to work with such a device abstraction just after the memdev creation, it is not there yet.

Oh, is the concern that you always want to have the memdev attached to cxl_mem immediately after it is registered?

I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is needed. However, to fix this situation once and for all I think I would rather just drop all this modularity and move both cxl_port and cxl_mem to be drivers internal to cxl_core.ko similar to the cxl_region driver.

Oh, so the problem is the code is not ready because the functionality is in a module not loaded yet.

Then it makes sense that change. I'll do it if not already taken. I'll send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the previous loop with delays implemented in v3.

Thanks

...

...
It is true that clx_mem_probe will interact with the real device, but the fact is such a function is not invoked by the device model synchronously, so the code using such a device abstraction needs to wait until it is there. With this flag the waiting is implicit to device creation. Without that flag other "nasty dancing" with delays and checks needs to be done as the code in v3 did.

It is invoked synchronously *if* the driver is loaded *and* the user has not forced asynchronous attach on the command line in which case they get to keep the pieces.

...
...
For the type-2 case I did have an EPROBE_DEFER in my initial RFC on the assumption that an accelerator driver might want to wait until CXL is initialized before the base accelerator proceeds. However, if accelerator drivers behave the same as the cxl_pci driver and are ok with asynchronus arrival of CXL functionality then no deferral is needed.

I think deferring the accel driver makes sense. In the sfc driver case, it could work without CXL and then change to use it once the CXL kernel support is fully initialised, but I guess other accel drivers will rely on CXL with no other option, and even with the sfc driver, supporting such a change will make the code far more complex.

Makes sense.

...
...
Otherwise, the only motivation for synchronous probing I can think of would be to have more predictable naming of kernel objects. So yes, I would be curious to understand what scenarios probe deferral is still needed.

OK. I will keep that patch with the last change in the v4. Let's discuss this further with that patch as a reference.

EPROBE_DEFER when CXL is not ready yet is fine in the sfc driver, just comment that the driver does not have a PCIe-only operation mode and requires that CXL initializes.

Dan Williams

10:24 p.m.

Alejandro Lucero Palau wrote:

...

On 10/12/24 22:57, Dan Williams wrote:

...
Alejandro Lucero Palau wrote: [..]

...
...
I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any device-readiness bug. Some other assumption is violated if that is required.

But that problem is not about device readiness but just how the device model works. In this case the memdev creation is adding devices, no real ones but those abstractions we use from the device model, and that device creation is done asynchronously.

Device creation is not done asynchronously, the PCI driver is attaching asynchrounously. When the PCI driver attaches it creates memdevs and those are attached to cxl_mem synchronously.

...
memdev, a Type2 driver in my case, is going to work with such a device abstraction just after the memdev creation, it is not there yet.

Oh, is the concern that you always want to have the memdev attached to cxl_mem immediately after it is registered?

I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is needed. However, to fix this situation once and for all I think I would rather just drop all this modularity and move both cxl_port and cxl_mem to be drivers internal to cxl_core.ko similar to the cxl_region driver.

Oh, so the problem is the code is not ready because the functionality is in a module not loaded yet.

Right.

...

Then it makes sense that change. I'll do it if not already taken. I'll send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the previous loop with delays implemented in v3.

So I think EPROBE_DEFER can stay out of the CXL core because it is up to the accelerator driver to decide whether CXL availability is fatal to init or not.

Additionally, I am less and less convinced that Type-2 drivers should be forced to depend on the cxl_mem driver to attach vs just arranging for those Type-2 drivers to call devm_cxl_enumerate_ports() and devm_cxl_add_endpoint() directly. In other words I am starting to worry that the generic cxl_mem driver design pattern is a midlayer mistake.

So, if it makes it easier for sfc, I would be open to exploring a direct scheme for endpoint attachment, and not requiring the generic cxl_mem driver as an intermediary.

Alejandro Lucero Palau

15 Oct 15 Oct

8:45 a.m.

On 10/14/24 23:24, Dan Williams wrote:

...

Alejandro Lucero Palau wrote:

...
On 10/12/24 22:57, Dan Williams wrote:

...
Alejandro Lucero Palau wrote: [..]

...
...
I am skeptical that PROBE_FORCE_SYNCRONOUS is a fix for any device-readiness bug. Some other assumption is violated if that is required.

But that problem is not about device readiness but just how the device model works. In this case the memdev creation is adding devices, no real ones but those abstractions we use from the device model, and that device creation is done asynchronously.

Device creation is not done asynchronously, the PCI driver is attaching asynchrounously. When the PCI driver attaches it creates memdevs and those are attached to cxl_mem synchronously.

...
memdev, a Type2 driver in my case, is going to work with such a device abstraction just after the memdev creation, it is not there yet.

Oh, is the concern that you always want to have the memdev attached to cxl_mem immediately after it is registered?

I think that is another case where "MODULE_SOFTDEP("pre: cxl_mem")" is needed. However, to fix this situation once and for all I think I would rather just drop all this modularity and move both cxl_port and cxl_mem to be drivers internal to cxl_core.ko similar to the cxl_region driver.

Oh, so the problem is the code is not ready because the functionality is in a module not loaded yet.

Right.

...
Then it makes sense that change. I'll do it if not already taken. I'll send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the previous loop with delays implemented in v3.

So I think EPROBE_DEFER can stay out of the CXL core because it is up to the accelerator driver to decide whether CXL availability is fatal to init or not.

It needs support from the cxl core though. If the cxl root is not there yet, the driver needs to know, and that is what you did in your original patch and I'm keeping as well.

...

Additionally, I am less and less convinced that Type-2 drivers should be forced to depend on the cxl_mem driver to attach vs just arranging for those Type-2 drivers to call devm_cxl_enumerate_ports() and devm_cxl_add_endpoint() directly. In other words I am starting to worry that the generic cxl_mem driver design pattern is a midlayer mistake.

You know better than me but in my view, a Type2 should follow what a Type3 does with some small changes for dealing with the differences, mainly the way it is going to be used and those optional capabilities for Type2. This makes more sense to me for Type1.

...

So, if it makes it easier for sfc, I would be open to exploring a direct scheme for endpoint attachment, and not requiring the generic cxl_mem driver as an intermediary.

V4 is ready (just having problems when testing with 6.12-rcX) so I would like to keep it and explore this once we have something working and accepted. Type2 and Type1 with CXL cache will bring new challenges and I bet we will need refactoring in the code and potentially new design for generic code (Type3 and Type2, Type2 and Type1).

Dan Williams

4:37 p.m.

Alejandro Lucero Palau wrote: [..]

...

...
...
Then it makes sense that change. I'll do it if not already taken. I'll send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the previous loop with delays implemented in v3.

So I think EPROBE_DEFER can stay out of the CXL core because it is up to the accelerator driver to decide whether CXL availability is fatal to init or not.

It needs support from the cxl core though. If the cxl root is not there yet, the driver needs to know, and that is what you did in your original patch and I'm keeping as well.

So there are two ways, check if a registered @memdev has @memdev->dev.driver set, assuming you know that the cxl_mem driver is available, or call devm_cxl_enumerate_ports() yourself and skip the cxl_mem driver indirection.

Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally had, is an even more indirect way to convey a similar result and is starting to feel a bit "mid-layer-y".

...

...
Additionally, I am less and less convinced that Type-2 drivers should be forced to depend on the cxl_mem driver to attach vs just arranging for those Type-2 drivers to call devm_cxl_enumerate_ports() and devm_cxl_add_endpoint() directly. In other words I am starting to worry that the generic cxl_mem driver design pattern is a midlayer mistake.

You know better than me but in my view, a Type2 should follow what a Type3 does with some small changes for dealing with the differences, mainly the way it is going to be used and those optional capabilities for Type2. This makes more sense to me for Type1.

If an accelerator driver *wants* to depend on cxl_mem, it can, but all the cxl_core functions that cxl_mem uses are also exported for accelerator drivers to use directly to avoid needing to guess about when / if cxl_mem is going to attach.

...

...
So, if it makes it easier for sfc, I would be open to exploring a direct scheme for endpoint attachment, and not requiring the generic cxl_mem driver as an intermediary.

V4 is ready (just having problems when testing with 6.12-rcX) so I would like to keep it and explore this once we have something working and accepted. Type2 and Type1 with CXL cache will bring new challenges and I bet we will need refactoring in the code and potentially new design for generic code (Type3 and Type2, Type2 and Type1).

Yeah, no need to switch horses mid-race if you already have a cxl_mem dependent approach nearly ready, but a potential simplification to explore going forward.

Alejandro Lucero Palau

16 Oct 16 Oct

2:41 p.m.

On 10/15/24 17:37, Dan Williams wrote:

...

Alejandro Lucero Palau wrote: [..]

...
...
...
Then it makes sense that change. I'll do it if not already taken. I'll send v4 without the PROBE_FORCE_SYNCHRONOUS flag and without the previous loop with delays implemented in v3.

So I think EPROBE_DEFER can stay out of the CXL core because it is up to the accelerator driver to decide whether CXL availability is fatal to init or not.

It needs support from the cxl core though. If the cxl root is not there yet, the driver needs to know, and that is what you did in your original patch and I'm keeping as well.

So there are two ways, check if a registered @memdev has @memdev->dev.driver set, assuming you know that the cxl_mem driver is available, or call devm_cxl_enumerate_ports() yourself and skip the cxl_mem driver indirection.

Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally had, is an even more indirect way to convey a similar result and is starting to feel a bit "mid-layer-y".

I was a bit confused with this answer until I read again the patch commit from your original work.

The confusion came from my assumption about if the root device is not there, it is due to the hardware root initialization requiring more time. But I realize now you specifically said "the root driver has not attached yet" what turns it into this problem of kernel modules not loaded yet.

If so, I think I can solve this within the type2 driver code and kconfig. Kconfig will force the driver being compiled as a module if the cxl core is a module, and with MODULE_SOFTDEP("pre: cxl_core cxl_port cxl_acpi cxl-mem) the type2 driver modprobe will trigger loading of dependencies beforehand. This makes unnecessary EPROBE_DEFER.

What do you think?

...

...
...
Additionally, I am less and less convinced that Type-2 drivers should be forced to depend on the cxl_mem driver to attach vs just arranging for those Type-2 drivers to call devm_cxl_enumerate_ports() and devm_cxl_add_endpoint() directly. In other words I am starting to worry that the generic cxl_mem driver design pattern is a midlayer mistake.

You know better than me but in my view, a Type2 should follow what a Type3 does with some small changes for dealing with the differences, mainly the way it is going to be used and those optional capabilities for Type2. This makes more sense to me for Type1.

If an accelerator driver *wants* to depend on cxl_mem, it can, but all the cxl_core functions that cxl_mem uses are also exported for accelerator drivers to use directly to avoid needing to guess about when / if cxl_mem is going to attach.

...
...
So, if it makes it easier for sfc, I would be open to exploring a direct scheme for endpoint attachment, and not requiring the generic cxl_mem driver as an intermediary.

V4 is ready (just having problems when testing with 6.12-rcX) so I would like to keep it and explore this once we have something working and accepted. Type2 and Type1 with CXL cache will bring new challenges and I bet we will need refactoring in the code and potentially new design for generic code (Type3 and Type2, Type2 and Type1).

Yeah, no need to switch horses mid-race if you already have a cxl_mem dependent approach nearly ready, but a potential simplification to explore going forward.

Dan Williams

23 Oct 23 Oct

12:46 a.m.

Alejandro Lucero Palau wrote: [..]

...

...
Setting @memdev->endpoint to ERR_PTR(-EPROBE_DEFER), as I originally had, is an even more indirect way to convey a similar result and is starting to feel a bit "mid-layer-y".

I was a bit confused with this answer until I read again the patch commit from your original work.

The confusion came from my assumption about if the root device is not there, it is due to the hardware root initialization requiring more time. But I realize now you specifically said "the root driver has not attached yet" what turns it into this problem of kernel modules not loaded yet.

If so, I think I can solve this within the type2 driver code and kconfig. Kconfig will force the driver being compiled as a module...

There should be no requirement that accelerator drivers must be built as modules. An accelerator driver simply cannot enforce, via module load order, that CXL root infrastructure is up and ready before the accelerator 'probe' routine runs. This is because enumeration order still dominiates and enumeration order is effectively random*.

The accelerator driver only has 2 options, return EPROBE_DEFER until all resource dependencies are ready, or do what cxl_pci + cxl_mem do. What cxl_pci + cxl_mem do is, cxl_pci_probe() registers a memdev and then at some point later cxl_mem notices that the root infrastructure has arrived via the cxl_bus_rescan() event.

Note that these patches are about fixing the assumptions of cxl_bus_rescan(), not about ensuring init order.

* ...at least nothing should break if CXL root and CXL endpoint enumeration happens out of order.

263

days inactive

275

days old

linux-stable-mirror@lists.linaro.org

24 comments

participants

tags (0)

participants (5)

Alejandro Lucero Palau
Dan Williams
Gregory Price
Jonathan Cameron
Zijun Hu