Linaro-mm-sig November 2024

linaro-mm-sig@lists.linaro.org

8 participants
15 discussions

Re: [PATCH v2 RESEND 3/3] i2c: i2c-qcom-geni: Add Block event interrupt support

by Bjorn Andersson

On Mon, Nov 11, 2024 at 07:32:44PM +0530, Jyothi Kumar Seerapu wrote: > The I2C driver gets an interrupt upon transfer completion. > For multiple messages in a single transfer, N interrupts will be > received for N messages, leading to significant software interrupt > latency. To mitigate this latency, utilize Block Event Interrupt (BEI) Please rewrite this to the tone that the reader doesn't know what Block Event Interrupt is, or that it exists. > only when an interrupt is necessary. This means large transfers can be > split into multiple chunks of 8 messages internally, without expecting > interrupts for the first 7 message completions, only the last one will > trigger an interrupt indicating 8 messages completed. > > By implementing BEI, multi-message transfers can be divided into > chunks of 8 messages, improving overall transfer time. You already wrote this in the paragraph above. Where is this number 8 coming from btw? > This optimization reduces transfer time from 168 ms to 48 ms for a > series of 200 I2C write messages in a single transfer, with a > clock frequency support of 100 kHz. > > BEI optimizations are currently implemented for I2C write transfers only, > as there is no use case for multiple I2C read messages in a single transfer > at this time. > > Signed-off-by: Jyothi Kumar Seerapu <quic_jseerapu(a)quicinc.com> > --- > > v1 -> v2: > - Moved gi2c_gpi_xfer->msg_idx_cnt to separate local variable. > - Updated goto labels for error scenarios in geni_i2c_gpi function > - memset tx_multi_xfer to 0. > - Removed passing current msg index to geni_i2c_gpi. > - Fixed kernel test robot reported compilation issues. > > drivers/i2c/busses/i2c-qcom-geni.c | 203 +++++++++++++++++++++++++---- > 1 file changed, 178 insertions(+), 25 deletions(-) > > diff --git a/drivers/i2c/busses/i2c-qcom-geni.c b/drivers/i2c/busses/i2c-qcom-geni.c > index 7a22e1f46e60..04a7d926dadc 100644 > --- a/drivers/i2c/busses/i2c-qcom-geni.c > +++ b/drivers/i2c/busses/i2c-qcom-geni.c > @@ -100,6 +100,10 @@ struct geni_i2c_dev { > struct dma_chan *rx_c; > bool gpi_mode; > bool abort_done; > + bool is_tx_multi_xfer; > + u32 num_msgs; > + u32 tx_irq_cnt; > + struct gpi_i2c_config *gpi_config; > }; > > struct geni_i2c_desc { > @@ -500,6 +504,7 @@ static int geni_i2c_tx_one_msg(struct geni_i2c_dev *gi2c, struct i2c_msg *msg, > static void i2c_gpi_cb_result(void *cb, const struct dmaengine_result *result) > { > struct geni_i2c_dev *gi2c = cb; > + struct gpi_multi_xfer *tx_multi_xfer; > > if (result->result != DMA_TRANS_NOERROR) { > dev_err(gi2c->se.dev, "DMA txn failed:%d\n", result->result); > @@ -508,7 +513,21 @@ static void i2c_gpi_cb_result(void *cb, const struct dmaengine_result *result) > dev_dbg(gi2c->se.dev, "DMA xfer has pending: %d\n", result->residue); > } > > - complete(&gi2c->done); > + if (gi2c->is_tx_multi_xfer) { Wouldn't it be cleaner to treat the !is_tx_multi_xfer case as a multi-xfer of length 1? > + tx_multi_xfer = &gi2c->gpi_config->multi_xfer; > + > + /* > + * Send Completion for last message or multiple of NUM_MSGS_PER_IRQ. > + */ > + if ((tx_multi_xfer->irq_msg_cnt == gi2c->num_msgs - 1) || > + (!((tx_multi_xfer->irq_msg_cnt + 1) % NUM_MSGS_PER_IRQ))) { > + tx_multi_xfer->irq_cnt++; > + complete(&gi2c->done); Why? You're removing the wait_for_completion_timeout() from geni_i2c_gpi_xfer() when is_tx_multi_xfer is set. > + } > + tx_multi_xfer->irq_msg_cnt++; > + } else { > + complete(&gi2c->done); > + } > } > > static void geni_i2c_gpi_unmap(struct geni_i2c_dev *gi2c, struct i2c_msg *msg, > @@ -526,7 +545,42 @@ static void geni_i2c_gpi_unmap(struct geni_i2c_dev *gi2c, struct i2c_msg *msg, > } > } > > -static int geni_i2c_gpi(struct geni_i2c_dev *gi2c, struct i2c_msg *msg, > +/** > + * gpi_i2c_multi_desc_unmap() - unmaps the buffers post multi message TX transfers > + * @dev: pointer to the corresponding dev node > + * @gi2c: i2c dev handle > + * @msgs: i2c messages array > + * @peripheral: pointer to the gpi_i2c_config > + */ > +static void gpi_i2c_multi_desc_unmap(struct geni_i2c_dev *gi2c, struct i2c_msg msgs[], > + struct gpi_i2c_config *peripheral) > +{ > + u32 msg_xfer_cnt, wr_idx = 0; > + struct gpi_multi_xfer *tx_multi_xfer = &peripheral->multi_xfer; > + > + /* > + * In error case, need to unmap all messages based on the msg_idx_cnt. > + * Non-error case unmap all the processed messages. What is the benefit of this optimization, compared to keeping things simple and just unmap all buffers at the end of geni_i2c_gpi_xfer()? > + */ > + if (gi2c->err) > + msg_xfer_cnt = tx_multi_xfer->msg_idx_cnt; > + else > + msg_xfer_cnt = tx_multi_xfer->irq_cnt * NUM_MSGS_PER_IRQ; > + > + /* Unmap the processed DMA buffers based on the received interrupt count */ > + for (; tx_multi_xfer->unmap_msg_cnt < msg_xfer_cnt; tx_multi_xfer->unmap_msg_cnt++) { > + if (tx_multi_xfer->unmap_msg_cnt == gi2c->num_msgs) > + break; > + wr_idx = tx_multi_xfer->unmap_msg_cnt % QCOM_GPI_MAX_NUM_MSGS; > + geni_i2c_gpi_unmap(gi2c, &msgs[tx_multi_xfer->unmap_msg_cnt], > + tx_multi_xfer->dma_buf[wr_idx], > + tx_multi_xfer->dma_addr[wr_idx], > + NULL, (dma_addr_t)NULL); > + tx_multi_xfer->freed_msg_cnt++; > + } > +} > + > +static int geni_i2c_gpi(struct geni_i2c_dev *gi2c, struct i2c_msg msgs[], > struct dma_slave_config *config, dma_addr_t *dma_addr_p, > void **buf, unsigned int op, struct dma_chan *dma_chan) > { > @@ -538,26 +592,48 @@ static int geni_i2c_gpi(struct geni_i2c_dev *gi2c, struct i2c_msg *msg, > enum dma_transfer_direction dma_dirn; > struct dma_async_tx_descriptor *desc; > int ret; > + struct gpi_multi_xfer *gi2c_gpi_xfer; > + dma_cookie_t cookie; > + u32 msg_idx; > > peripheral = config->peripheral_config; > - > - dma_buf = i2c_get_dma_safe_msg_buf(msg, 1); > - if (!dma_buf) > - return -ENOMEM; > + gi2c_gpi_xfer = &peripheral->multi_xfer; > + dma_buf = gi2c_gpi_xfer->dma_buf[gi2c_gpi_xfer->buf_idx]; > + addr = gi2c_gpi_xfer->dma_addr[gi2c_gpi_xfer->buf_idx]; > + msg_idx = gi2c_gpi_xfer->msg_idx_cnt; > + > + dma_buf = i2c_get_dma_safe_msg_buf(&msgs[msg_idx], 1); > + if (!dma_buf) { > + ret = -ENOMEM; > + goto out; > + } > > if (op == I2C_WRITE) > map_dirn = DMA_TO_DEVICE; > else > map_dirn = DMA_FROM_DEVICE; > > - addr = dma_map_single(gi2c->se.dev->parent, dma_buf, msg->len, map_dirn); > + addr = dma_map_single(gi2c->se.dev->parent, dma_buf, > + msgs[msg_idx].len, map_dirn); > if (dma_mapping_error(gi2c->se.dev->parent, addr)) { > - i2c_put_dma_safe_msg_buf(dma_buf, msg, false); > - return -ENOMEM; > + i2c_put_dma_safe_msg_buf(dma_buf, &msgs[msg_idx], false); > + ret = -ENOMEM; > + goto out; > + } > + > + if (gi2c->is_tx_multi_xfer) { > + if (((msg_idx + 1) % NUM_MSGS_PER_IRQ)) > + peripheral->flags |= QCOM_GPI_BLOCK_EVENT_IRQ; > + else > + peripheral->flags &= ~QCOM_GPI_BLOCK_EVENT_IRQ; > + > + /* BEI bit to be cleared for last TRE */ > + if (msg_idx == gi2c->num_msgs - 1) > + peripheral->flags &= ~QCOM_GPI_BLOCK_EVENT_IRQ; > } > > /* set the length as message for rx txn */ > - peripheral->rx_len = msg->len; > + peripheral->rx_len = msgs[msg_idx].len; > peripheral->op = op; > > ret = dmaengine_slave_config(dma_chan, config); > @@ -575,7 +651,8 @@ static int geni_i2c_gpi(struct geni_i2c_dev *gi2c, struct i2c_msg *msg, > else > dma_dirn = DMA_DEV_TO_MEM; > > - desc = dmaengine_prep_slave_single(dma_chan, addr, msg->len, dma_dirn, flags); > + desc = dmaengine_prep_slave_single(dma_chan, addr, msgs[msg_idx].len, > + dma_dirn, flags); > if (!desc) { > dev_err(gi2c->se.dev, "prep_slave_sg failed\n"); > ret = -EIO; > @@ -585,15 +662,48 @@ static int geni_i2c_gpi(struct geni_i2c_dev *gi2c, struct i2c_msg *msg, > desc->callback_result = i2c_gpi_cb_result; > desc->callback_param = gi2c; > > - dmaengine_submit(desc); > - *buf = dma_buf; > - *dma_addr_p = addr; > + if (!((msgs[msg_idx].flags & I2C_M_RD) && op == I2C_WRITE)) { > + gi2c_gpi_xfer->msg_idx_cnt++; > + gi2c_gpi_xfer->buf_idx = (msg_idx + 1) % QCOM_GPI_MAX_NUM_MSGS; > + } > + cookie = dmaengine_submit(desc); > + if (dma_submit_error(cookie)) { > + dev_err(gi2c->se.dev, > + "%s: dmaengine_submit failed (%d)\n", __func__, cookie); > + ret = -EINVAL; > + goto err_config; > + } > > + if (gi2c->is_tx_multi_xfer) { > + dma_async_issue_pending(gi2c->tx_c); > + if ((msg_idx == (gi2c->num_msgs - 1)) || > + (gi2c_gpi_xfer->msg_idx_cnt >= > + QCOM_GPI_MAX_NUM_MSGS + gi2c_gpi_xfer->freed_msg_cnt)) { > + ret = gpi_multi_desc_process(gi2c->se.dev, gi2c_gpi_xfer, A function call straight into the GPI driver? I'm not entirely familiar with the details of the dmaengine API, but this doesn't look correct. > + gi2c->num_msgs, XFER_TIMEOUT, > + &gi2c->done); > + if (ret) { > + dev_err(gi2c->se.dev, > + "I2C multi write msg transfer timeout: %d\n", > + ret); > + gi2c->err = ret; > + goto err_config; > + } > + } > + } else { > + /* Non multi descriptor message transfer */ > + *buf = dma_buf; > + *dma_addr_p = addr; > + } > return 0; > > err_config: > - dma_unmap_single(gi2c->se.dev->parent, addr, msg->len, map_dirn); > - i2c_put_dma_safe_msg_buf(dma_buf, msg, false); > + dma_unmap_single(gi2c->se.dev->parent, addr, > + msgs[msg_idx].len, map_dirn); > + i2c_put_dma_safe_msg_buf(dma_buf, &msgs[msg_idx], false); > + > +out: > + gi2c->err = ret; > return ret; > } > > @@ -605,6 +715,7 @@ static int geni_i2c_gpi_xfer(struct geni_i2c_dev *gi2c, struct i2c_msg msgs[], i > unsigned long time_left; > dma_addr_t tx_addr, rx_addr; > void *tx_buf = NULL, *rx_buf = NULL; > + struct gpi_multi_xfer *tx_multi_xfer; > const struct geni_i2c_clk_fld *itr = gi2c->clk_fld; > > config.peripheral_config = &peripheral; > @@ -618,6 +729,34 @@ static int geni_i2c_gpi_xfer(struct geni_i2c_dev *gi2c, struct i2c_msg msgs[], i > peripheral.set_config = 1; > peripheral.multi_msg = false; > > + gi2c->gpi_config = &peripheral; > + gi2c->num_msgs = num; > + gi2c->is_tx_multi_xfer = false; > + gi2c->tx_irq_cnt = 0; > + > + tx_multi_xfer = &peripheral.multi_xfer; > + memset(tx_multi_xfer, 0, sizeof(struct gpi_multi_xfer)); > + > + /* > + * If number of write messages are four and higher then Why four? > + * configure hardware for multi descriptor transfers with BEI. > + */ > + if (num >= MIN_NUM_OF_MSGS_MULTI_DESC) { > + gi2c->is_tx_multi_xfer = true; > + for (i = 0; i < num; i++) { > + if (msgs[i].flags & I2C_M_RD) { > + /* > + * Multi descriptor transfer with BEI > + * support is enabled for write transfers. > + * Add BEI optimization support for read > + * transfers later. Prefix this comment with "TODO:" > + */ > + gi2c->is_tx_multi_xfer = false; > + break; > + } > + } > + } > + > for (i = 0; i < num; i++) { > gi2c->cur = &msgs[i]; > gi2c->err = 0; > @@ -628,14 +767,16 @@ static int geni_i2c_gpi_xfer(struct geni_i2c_dev *gi2c, struct i2c_msg msgs[], i > peripheral.stretch = 1; > > peripheral.addr = msgs[i].addr; > + if (i > 0 && (!(msgs[i].flags & I2C_M_RD))) > + peripheral.multi_msg = false; > > - ret = geni_i2c_gpi(gi2c, &msgs[i], &config, > + ret = geni_i2c_gpi(gi2c, msgs, &config, > &tx_addr, &tx_buf, I2C_WRITE, gi2c->tx_c); > if (ret) > goto err; > > if (msgs[i].flags & I2C_M_RD) { > - ret = geni_i2c_gpi(gi2c, &msgs[i], &config, > + ret = geni_i2c_gpi(gi2c, msgs, &config, > &rx_addr, &rx_buf, I2C_READ, gi2c->rx_c); > if (ret) > goto err; > @@ -643,18 +784,26 @@ static int geni_i2c_gpi_xfer(struct geni_i2c_dev *gi2c, struct i2c_msg msgs[], i > dma_async_issue_pending(gi2c->rx_c); > } > > - dma_async_issue_pending(gi2c->tx_c); > - > - time_left = wait_for_completion_timeout(&gi2c->done, XFER_TIMEOUT); > - if (!time_left) > - gi2c->err = -ETIMEDOUT; > + if (!gi2c->is_tx_multi_xfer) { > + dma_async_issue_pending(gi2c->tx_c); > + time_left = wait_for_completion_timeout(&gi2c->done, XFER_TIMEOUT); By making this conditional on !is_tx_multi_xfer transfers, what makes the loop wait for the transfer to complete before you below unmap the buffers? > + if (!time_left) { > + dev_err(gi2c->se.dev, "%s:I2C timeout\n", __func__); > + gi2c->err = -ETIMEDOUT; > + } > + } > > if (gi2c->err) { > ret = gi2c->err; > goto err; > } > > - geni_i2c_gpi_unmap(gi2c, &msgs[i], tx_buf, tx_addr, rx_buf, rx_addr); > + if (!gi2c->is_tx_multi_xfer) { > + geni_i2c_gpi_unmap(gi2c, &msgs[i], tx_buf, tx_addr, rx_buf, rx_addr); > + } else if (gi2c->tx_irq_cnt != tx_multi_xfer->irq_cnt) { > + gi2c->tx_irq_cnt = tx_multi_xfer->irq_cnt; > + gpi_i2c_multi_desc_unmap(gi2c, msgs, &peripheral); > + } > } > > return num; > @@ -663,7 +812,11 @@ static int geni_i2c_gpi_xfer(struct geni_i2c_dev *gi2c, struct i2c_msg msgs[], i > dev_err(gi2c->se.dev, "GPI transfer failed: %d\n", ret); > dmaengine_terminate_sync(gi2c->rx_c); > dmaengine_terminate_sync(gi2c->tx_c); > - geni_i2c_gpi_unmap(gi2c, &msgs[i], tx_buf, tx_addr, rx_buf, rx_addr); > + if (gi2c->is_tx_multi_xfer) > + gpi_i2c_multi_desc_unmap(gi2c, msgs, &peripheral); > + else > + geni_i2c_gpi_unmap(gi2c, &msgs[i], tx_buf, tx_addr, rx_buf, rx_addr); > + As above, it would be nice if multi-xfer was just a special case with a single buffer; rather than inflating the cyclomatic complexity. Regards, Bjorn > return ret; > } > > -- > 2.17.1 > >

11 months, 1 week

Re: [PATCH v2 RESEND 2/3] i2c: qcom_geni: Update compile dependenices for qcom geni

by Bjorn Andersson

On Mon, Nov 11, 2024 at 07:32:43PM +0530, Jyothi Kumar Seerapu wrote: > I2C_QCOM_GENI is having compile dependencies on QCOM_GPI_DMA and > so update I2C_QCOM_GENI to depends on QCOM_GPI_DMA. > Given that this is a separate patch, your wording can only be interpreted as this being an existing problem. > Signed-off-by: Jyothi Kumar Seerapu <quic_jseerapu(a)quicinc.com> > --- > > v1 -> v2: > This patch is added in v2 to address the kernel test robot > reported compilation error. > ERROR: modpost: "gpi_multi_desc_process" [drivers/i2c/busses/i2c-qcom-geni.ko] undefined! But as far as I can tell you introduce this problem in patch 3. If so this addition should be part of patch 3. Also, you have different subject prefix for patch 2 and 3, yet they relate to the same driver. Not pretty. Regards, Bjorn > > drivers/i2c/busses/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig > index 0aa948014008..87634a682855 100644 > --- a/drivers/i2c/busses/Kconfig > +++ b/drivers/i2c/busses/Kconfig > @@ -1049,6 +1049,7 @@ config I2C_QCOM_GENI > tristate "Qualcomm Technologies Inc.'s GENI based I2C controller" > depends on ARCH_QCOM || COMPILE_TEST > depends on QCOM_GENI_SE > + depends on QCOM_GPI_DMA > help > This driver supports GENI serial engine based I2C controller in > master mode on the Qualcomm Technologies Inc.'s SoCs. If you say > -- > 2.17.1 > >

11 months, 1 week

Re: [PATCH 2/3] dma-buf: sort fences in dma_fence_unwrap_merge

by Christian König

Am 08.11.24 um 12:22 schrieb Tvrtko Ursulin: > On 07/11/2024 16:00, Tvrtko Ursulin wrote: >> On 24/10/2024 13:41, Christian König wrote: >>> The merge function initially handled only individual fences and >>> arrays which in turn were created by the merge function. This allowed >>> to create the new array by a simple merge sort based on the fence >>> context number. >>> >>> The problem is now that since the addition of timeline sync objects >>> userspace can create chain containers in basically any fence context >>> order. >>> >>> If those are merged together it can happen that we create really >>> large arrays since the merge sort algorithm doesn't work any more. >>> >>> So put an insert sort behind the merge sort which kicks in when the >>> input fences are not in the expected order. This isn't as efficient >>> as a heap sort, but has better properties for the most common use >>> case. >>> >>> Signed-off-by: Christian König <christian.koenig(a)amd.com> >>> --- >>> drivers/dma-buf/dma-fence-unwrap.c | 39 >>> ++++++++++++++++++++++++++---- >>> 1 file changed, 34 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/dma-buf/dma-fence-unwrap.c >>> b/drivers/dma-buf/dma-fence-unwrap.c >>> index 628af51c81af..d9aa280d9ff6 100644 >>> --- a/drivers/dma-buf/dma-fence-unwrap.c >>> +++ b/drivers/dma-buf/dma-fence-unwrap.c >>> @@ -106,7 +106,7 @@ struct dma_fence >>> *__dma_fence_unwrap_merge(unsigned int num_fences, >>> fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]); >>> count = 0; >>> - do { >>> + while (true) { >>> unsigned int sel; >>> restart: >>> @@ -144,11 +144,40 @@ struct dma_fence >>> *__dma_fence_unwrap_merge(unsigned int num_fences, >>> } >>> } >>> - if (tmp) { >>> - array[count++] = dma_fence_get(tmp); >>> - fences[sel] = dma_fence_unwrap_next(&iter[sel]); >>> + if (!tmp) >>> + break; >>> + >>> + /* >>> + * We could use a binary search here, but since the assumption >>> + * is that the main input are already sorted dma_fence_arrays >>> + * just looking from end has a higher chance of finding the >>> + * right location on the first try >>> + */ >>> + >>> + for (i = count; i--;) { >>> + if (likely(array[i]->context < tmp->context)) >>> + break; >>> + >>> + if (array[i]->context == tmp->context) { >>> + if (dma_fence_is_later(tmp, array[i])) { >>> + dma_fence_put(array[i]); >>> + array[i] = dma_fence_get(tmp); >>> + } >>> + fences[sel] = dma_fence_unwrap_next(&iter[sel]); >>> + goto restart; >>> + } >>> } >>> - } while (tmp); >>> + >>> + ++i; >>> + /* >>> + * Make room for the fence, this should be a nop most of the >>> + * time. >>> + */ >>> + memcpy(&array[i + 1], &array[i], (count - i) * >>> sizeof(*array)); >>> + array[i] = dma_fence_get(tmp); >>> + fences[sel] = dma_fence_unwrap_next(&iter[sel]); >>> + count++; >> >> Having ventured into this function for the first time, I can say that >> this is some smart code which is not easy to grasp. It could >> definitely benefit from a high level comment before the do-while loop >> to explain what it is going to do. >> >> Next and tmp local variable names I also wonder if could be renamed >> to something more descriptive. >> >> And the algorithmic complexity of the end result, given the multiple >> loops and gotos, I have no idea what it could be. >> >> Has a dumb solution been considered like a two-pass with a >> pessimistically allocated fence array been considered? Like: >> >> 1) Populate array with all unsignalled unwrapped fences. (O(count)) >> >> 2) Bog standard include/linux/sort.h by context and seqno. >> (O(count*log (count))) >> >> 3) Walk array and squash same context to latest fence. (Before this >> patch that wasn't there, right?). (O(count)) (Overwrite in place, no >> memcpy needed.) >> >> Algorithmic complexity of that would be obvious and code much simpler. > > FWIW something like the below passes selftests. How does it look to > you? Do you think more or less efficient and more or less readable? Yeah I was considering the exact same thing. What hold me back was the fact that the heap sort() implementation is really inefficient for the most common use case of this. In other words two arrays with fences already sorted is basically just O(count). And I'm also not sure how many fences we see in those arrays in practice. With Vulkan basically trying to feed multiple contexts to keep all CPUs busy we might have quite a number here. Regards, Christian. > > commit 8a7c3ea7e7af85e813bf5fc151537ae37be1d6d9 > Author: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com> > Date: Fri Nov 8 10:14:15 2024 +0000 > > __dma_fence_unwrap_merge > > Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin(a)igalia.com> > > diff --git a/drivers/dma-buf/dma-fence-unwrap.c > b/drivers/dma-buf/dma-fence-unwrap.c > index 628af51c81af..47d67e482e96 100644 > --- a/drivers/dma-buf/dma-fence-unwrap.c > +++ b/drivers/dma-buf/dma-fence-unwrap.c > @@ -12,6 +12,7 @@ > #include <linux/dma-fence-chain.h> > #include <linux/dma-fence-unwrap.h> > #include <linux/slab.h> > +#include <linux/sort.h> > > /* Internal helper to start new array iteration, don't use directly */ > static struct dma_fence * > @@ -59,17 +60,39 @@ struct dma_fence *dma_fence_unwrap_next(struct > dma_fence_unwrap *cursor) > } > EXPORT_SYMBOL_GPL(dma_fence_unwrap_next); > > + > +static int fence_cmp(const void *_a, const void *_b) > +{ > + const struct dma_fence *a = *(const struct dma_fence **)_a; > + const struct dma_fence *b = *(const struct dma_fence **)_b; > + > + if (a->context < b->context) > + return -1; > + else if (a->context > b->context) > + return 1; > + > + if (a->seqno < b->seqno) > + return -1; > + else if (a->seqno > b->seqno) > + return 1; > + > + return 0; > +} > + > /* Implementation for the dma_fence_merge() marco, don't use directly */ > struct dma_fence *__dma_fence_unwrap_merge(unsigned int num_fences, > struct dma_fence **fences, > struct dma_fence_unwrap *iter) > { > - struct dma_fence_array *result; > struct dma_fence *tmp, **array; > + struct dma_fence_array *result; > ktime_t timestamp; > - unsigned int i; > - size_t count; > + int i, j, count; > > + /* > + * Count number of unwrapped fences and fince the latest signaled > + * timestamp. > + */ > count = 0; > timestamp = ns_to_ktime(0); > for (i = 0; i < num_fences; ++i) { > @@ -92,63 +115,41 @@ struct dma_fence > *__dma_fence_unwrap_merge(unsigned int num_fences, > if (count == 0) > return dma_fence_allocate_private_stub(timestamp); > > + /* > + * Allocate and populate the array. > + */ > array = kmalloc_array(count, sizeof(*array), GFP_KERNEL); > if (!array) > return NULL; > > - /* > - * This trashes the input fence array and uses it as position for > the > - * following merge loop. This works because the dma_fence_merge() > - * wrapper macro is creating this temporary array on the stack > together > - * with the iterators. > - */ > - for (i = 0; i < num_fences; ++i) > - fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]); > - > count = 0; > - do { > - unsigned int sel; > - > -restart: > - tmp = NULL; > - for (i = 0; i < num_fences; ++i) { > - struct dma_fence *next; > - > - while (fences[i] && dma_fence_is_signaled(fences[i])) > - fences[i] = dma_fence_unwrap_next(&iter[i]); > - > - next = fences[i]; > - if (!next) > - continue; > - > - /* > - * We can't guarantee that inpute fences are ordered by > - * context, but it is still quite likely when this > - * function is used multiple times. So attempt to order > - * the fences by context as we pass over them and merge > - * fences with the same context. > - */ > - if (!tmp || tmp->context > next->context) { > - tmp = next; > - sel = i; > - > - } else if (tmp->context < next->context) { > - continue; > - > - } else if (dma_fence_is_later(tmp, next)) { > - fences[i] = dma_fence_unwrap_next(&iter[i]); > - goto restart; > - } else { > - fences[sel] = dma_fence_unwrap_next(&iter[sel]); > - goto restart; > - } > + for (i = 0; i < num_fences; ++i) { > + dma_fence_unwrap_for_each(tmp, &iter[i], fences[i]) { > + if (!dma_fence_is_signaled(tmp)) > + array[count++] = tmp; > } > + } > + > + /* > + * Sort in context and seqno order. > + */ > + sort(array, count, sizeof(*array), fence_cmp, NULL); > > - if (tmp) { > - array[count++] = dma_fence_get(tmp); > - fences[sel] = dma_fence_unwrap_next(&iter[sel]); > + /* > + * Only keep the most recent fence for each context. > + */ > + j = 0; > + tmp = array[0]; > + for (i = 1; i < count; i++) { > + if (array[i]->context != tmp->context) { > + array[j++] = dma_fence_get(tmp); > } > - } while (tmp); > + tmp = array[i]; > + } > + if (tmp->context != array[j - 1]->context) { > + array[j++] = dma_fence_get(tmp); > + } > + count = j; > > if (count == 0) { > tmp = dma_fence_allocate_private_stub(ktime_get()); > > > Regards, > > Tvrtko > > >> >> Regards, >> >> Tvrtko >> >>> + }; >>> if (count == 0) { >>> tmp = dma_fence_allocate_private_stub(ktime_get());

11 months, 2 weeks

Re: [PATCH 1/3] dma-buf/dma-fence_array: use kvzalloc

by Christian König

Am 07.11.24 um 12:29 schrieb Tvrtko Ursulin: > > On 28/10/2024 10:34, Christian König wrote: >> Am 25.10.24 um 11:05 schrieb Tvrtko Ursulin: >>> >>> On 25/10/2024 09:59, Tvrtko Ursulin wrote: >>>> >>>> On 24/10/2024 13:41, Christian König wrote: >>>>> Reports indicates that some userspace applications try to merge >>>>> more than >>>>> 80k of fences into a single dma_fence_array leading to a warning from >>>>> kzalloc() that the requested size becomes to big. >>>>> >>>>> While that is clearly an userspace bug we should probably handle >>>>> that case >>>>> gracefully in the kernel. >>>>> >>>>> So we can either reject requests to merge more than a reasonable >>>>> amount of >>>>> fences (64k maybe?) or we can start to use kvzalloc() instead of >>>>> kzalloc(). >>>>> This patch here does the later. >>>> >>>> Rejecting would potentially be safer, otherwise there is a path for >>>> userspace to trigger a warn in kvmalloc_node (see 0829b5bcdd3b >>>> ("drm/i915: 2 GiB of relocations ought to be enough for anybody*")) >>>> and spam dmesg at will. >>> >>> Actually that is a WARN_ON_*ONCE* there so maybe not so critical to >>> invent a limit. Up for discussion I suppose. >>> >>> Regards, >>> >>> Tvrtko >>> >>>> >>>> Question is what limit to set... >> >> That's one of the reasons why I opted for kvzalloc() initially. > > I didn't get that, what was the reason? To not have to invent an > arbitrary limit? Well that I couldn't come up with any arbitrary limit that I had confidence would work and not block real world use cases. Switching to kvzalloc() just seemed the more defensive approach. > >> I mean we could use some nice round number like 65536, but that would >> be totally arbitrary. > > Yeah.. Set an arbitrary limit so a warning in __kvmalloc_node_noprof() > is avoided? Or pass __GFP_NOWARN? Well are we sure that will never hit 65536 in a real world use case? It's still pretty low. > >> Any comments on the other two patches? I need to get them upstream. > > Will look into them shortly. Thanks, Christian. > > Regards, > > Tvrtko > > >> Thanks, >> Christian. >> >>>> >>>> Regards, >>>> >>>> Tvrtko >>>> >>>>> Signed-off-by: Christian König <christian.koenig(a)amd.com> >>>>> CC: stable(a)vger.kernel.org >>>>> --- >>>>> drivers/dma-buf/dma-fence-array.c | 6 +++--- >>>>> 1 file changed, 3 insertions(+), 3 deletions(-) >>>>> >>>>> diff --git a/drivers/dma-buf/dma-fence-array.c >>>>> b/drivers/dma-buf/dma-fence-array.c >>>>> index 8a08ffde31e7..46ac42bcfac0 100644 >>>>> --- a/drivers/dma-buf/dma-fence-array.c >>>>> +++ b/drivers/dma-buf/dma-fence-array.c >>>>> @@ -119,8 +119,8 @@ static void dma_fence_array_release(struct >>>>> dma_fence *fence) >>>>> for (i = 0; i < array->num_fences; ++i) >>>>> dma_fence_put(array->fences[i]); >>>>> - kfree(array->fences); >>>>> - dma_fence_free(fence); >>>>> + kvfree(array->fences); >>>>> + kvfree_rcu(fence, rcu); >>>>> } >>>>> static void dma_fence_array_set_deadline(struct dma_fence *fence, >>>>> @@ -153,7 +153,7 @@ struct dma_fence_array >>>>> *dma_fence_array_alloc(int num_fences) >>>>> { >>>>> struct dma_fence_array *array; >>>>> - return kzalloc(struct_size(array, callbacks, num_fences), >>>>> GFP_KERNEL); >>>>> + return kvzalloc(struct_size(array, callbacks, num_fences), >>>>> GFP_KERNEL); >>>>> } >>>>> EXPORT_SYMBOL(dma_fence_array_alloc); >>

11 months, 2 weeks

Re: [PATCH 2/3] dma-buf: sort fences in dma_fence_unwrap_merge

by Christian König

Am 30.10.24 um 19:10 schrieb Friedrich Vock: > On 24.10.24 14:41, Christian König wrote: >> The merge function initially handled only individual fences and >> arrays which in turn were created by the merge function. This allowed >> to create the new array by a simple merge sort based on the fence >> context number. >> >> The problem is now that since the addition of timeline sync objects >> userspace can create chain containers in basically any fence context >> order. >> >> If those are merged together it can happen that we create really >> large arrays since the merge sort algorithm doesn't work any more. >> >> So put an insert sort behind the merge sort which kicks in when the >> input fences are not in the expected order. This isn't as efficient >> as a heap sort, but has better properties for the most common use >> case. >> >> Signed-off-by: Christian König <christian.koenig(a)amd.com> >> --- >> drivers/dma-buf/dma-fence-unwrap.c | 39 ++++++++++++++++++++++++++---- >> 1 file changed, 34 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/dma-buf/dma-fence-unwrap.c >> b/drivers/dma-buf/dma-fence-unwrap.c >> index 628af51c81af..d9aa280d9ff6 100644 >> --- a/drivers/dma-buf/dma-fence-unwrap.c >> +++ b/drivers/dma-buf/dma-fence-unwrap.c >> @@ -106,7 +106,7 @@ struct dma_fence >> *__dma_fence_unwrap_merge(unsigned int num_fences, >> fences[i] = dma_fence_unwrap_first(fences[i], &iter[i]); >> >> count = 0; >> - do { >> + while (true) { >> unsigned int sel; >> >> restart: >> @@ -144,11 +144,40 @@ struct dma_fence >> *__dma_fence_unwrap_merge(unsigned int num_fences, >> } >> } >> >> - if (tmp) { >> - array[count++] = dma_fence_get(tmp); >> - fences[sel] = dma_fence_unwrap_next(&iter[sel]); >> + if (!tmp) >> + break; >> + >> + /* >> + * We could use a binary search here, but since the assumption >> + * is that the main input are already sorted dma_fence_arrays >> + * just looking from end has a higher chance of finding the >> + * right location on the first try >> + */ >> + >> + for (i = count; i--;) { > > This is broken. The first iteration of this loop will always index out > of bounds. Nope, that is correct. The condition is evaluated before the loop, so the i-- reduces the index to the last element in the array. Regards, Christian. > What you probably want here is: > > + for (i = count - 1; count && i--;) { > > This intentionally overflows for count == 0, but the ++i after the loop > undoes that. Maybe it would be worth a comment to point out that's > intentional. > >> + if (likely(array[i]->context < tmp->context)) >> + break; >> + >> + if (array[i]->context == tmp->context) { >> + if (dma_fence_is_later(tmp, array[i])) { >> + dma_fence_put(array[i]); >> + array[i] = dma_fence_get(tmp); >> + } >> + fences[sel] = dma_fence_unwrap_next(&iter[sel]); >> + goto restart; >> + } >> } >> - } while (tmp); >> + >> + ++i; >> + /* >> + * Make room for the fence, this should be a nop most of the >> + * time. >> + */ >> + memcpy(&array[i + 1], &array[i], (count - i) * sizeof(*array)); > > Need memmove here, src and dst alias. > > I took it for a spin with these things fixed and it seemed to resolve > the issue as well. How do you want to proceed? I guess I would be > comfortable putting a Reviewed-by and/or Tested-by on a version with > these things fixed (with the usual caveat that I'm not a maintainer - I > guess the process requires (at least one) reviewer to be?). > > By the way, I guess you might've had some internal branches where this > fix needed to go into quick or something? Usually I'm happy to make a v2 > for my patches myself, too ;) > > Regards, > Friedrich > >> + array[i] = dma_fence_get(tmp); >> + fences[sel] = dma_fence_unwrap_next(&iter[sel]); >> + count++; >> + }; >> >> if (count == 0) { >> tmp = dma_fence_allocate_private_stub(ktime_get()); >

11 months, 2 weeks

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

Linaro-mm-sig November 2024