From: Rob Clark robdclark@gmail.com
For devices which have constraints about maximum number of segments in an sglist. For example, a device which could only deal with contiguous buffers would set max_segment_count to 1.
The initial motivation is for devices sharing buffers via dma-buf, to allow the buffer exporter to know the constraints of other devices which have attached to the buffer. The dma_mask and fields in 'struct device_dma_parameters' tell the exporter everything else that is needed, except whether the importer has constraints about maximum number of segments.
Signed-off-by: Rob Clark robdclark@gmail.com [sumits: Minor updates wrt comments] Signed-off-by: Sumit Semwal sumit.semwal@linaro.org ---
v3: include Robin Murphy's fix[1] for handling '0' as a value for max_segment_count v2: minor updates wrt comments on the first version
[1]: http://article.gmane.org/gmane.linux.kernel.iommu/8175/
include/linux/device.h | 1 + include/linux/dma-mapping.h | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+)
diff --git a/include/linux/device.h b/include/linux/device.h index fb506738f7b7..a32f9b67315c 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -647,6 +647,7 @@ struct device_dma_parameters { * sg limitations. */ unsigned int max_segment_size; + unsigned int max_segment_count; /* INT_MAX for unlimited */ unsigned long segment_boundary_mask; };
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index c3007cb4bfa6..d3351a36d5ec 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -154,6 +154,25 @@ static inline unsigned int dma_set_max_seg_size(struct device *dev, return -EIO; }
+#define DMA_SEGMENTS_MAX_SEG_COUNT ((unsigned int) INT_MAX) + +static inline unsigned int dma_get_max_seg_count(struct device *dev) +{ + if (dev->dma_parms && dev->dma_parms->max_segment_count) + return dev->dma_parms->max_segment_count; + return DMA_SEGMENTS_MAX_SEG_COUNT; +} + +static inline int dma_set_max_seg_count(struct device *dev, + unsigned int count) +{ + if (dev->dma_parms) { + dev->dma_parms->max_segment_count = count; + return 0; + } + return -EIO; +} + static inline unsigned long dma_get_seg_boundary(struct device *dev) { return dev->dma_parms ?
Add some helpers to share the constraints of devices while attaching to the dmabuf buffer.
At each attach, the constraints are calculated based on the following: - max_segment_size, max_segment_count, segment_boundary_mask from device_dma_parameters.
In case the attaching device's constraints don't match up, attach() fails.
At detach, the constraints are recalculated based on the remaining attached devices.
Two helpers are added: - dma_buf_get_constraints - which gives the current constraints as calculated during each attach on the buffer till the time, - dma_buf_recalc_constraints - which recalculates the constraints for all currently attached devices for the 'paranoid' ones amongst us.
The idea of this patch is largely taken from Rob Clark's RFC at https://lkml.org/lkml/2012/7/19/285, and the comments received on it.
Cc: Rob Clark robdclark@gmail.com Signed-off-by: Sumit Semwal sumit.semwal@linaro.org --- v3: - Thanks to Russell's comment, remove dma_mask and coherent_dma_mask from constraints' calculation; has a nice side effect of letting us use device_dma_parameters directly to list constraints. - update the debugfs output to show constraint info as well.
v2: split constraints-sharing and allocation helpers
drivers/dma-buf/dma-buf.c | 126 +++++++++++++++++++++++++++++++++++++++++++++- include/linux/dma-buf.h | 7 +++ 2 files changed, 132 insertions(+), 1 deletion(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 5be225c2ba98..f363f1440803 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -264,6 +264,66 @@ static inline int is_dma_buf_file(struct file *file) return file->f_op == &dma_buf_fops; }
+static inline void init_constraints(struct device_dma_parameters *cons) +{ + cons->max_segment_count = (unsigned int)-1; + cons->max_segment_size = (unsigned int)-1; + cons->segment_boundary_mask = (unsigned long)-1; +} + +/* + * calc_constraints - calculates if the new attaching device's constraints + * match, with the constraints of already attached devices; if yes, returns + * the constraints; else return ERR_PTR(-EINVAL) + */ +static int calc_constraints(struct device *dev, + struct device_dma_parameters *calc_cons) +{ + struct device_dma_parameters cons = *calc_cons; + + cons.max_segment_count = min(cons.max_segment_count, + dma_get_max_seg_count(dev)); + cons.max_segment_size = min(cons.max_segment_size, + dma_get_max_seg_size(dev)); + cons.segment_boundary_mask &= dma_get_seg_boundary(dev); + + if (!cons.max_segment_count || + !cons.max_segment_size || + !cons.segment_boundary_mask) { + pr_err("Dev: %s's constraints don't match\n", dev_name(dev)); + return -EINVAL; + } + + *calc_cons = cons; + + return 0; +} + +/* + * recalc_constraints - recalculates constraints for all attached devices; + * useful for detach() recalculation, and for dma_buf_recalc_constraints() + * helper. + * Returns recalculated constraints in recalc_cons, or error in the unlikely + * case when constraints of attached devices might have changed. + */ +static int recalc_constraints(struct dma_buf *dmabuf, + struct device_dma_parameters *recalc_cons) +{ + struct device_dma_parameters calc_cons; + struct dma_buf_attachment *attach; + int ret = 0; + + init_constraints(&calc_cons); + + list_for_each_entry(attach, &dmabuf->attachments, node) { + ret = calc_constraints(attach->dev, &calc_cons); + if (ret) + return ret; + } + *recalc_cons = calc_cons; + return 0; +} + /** * dma_buf_export_named - Creates a new dma_buf, and associates an anon file * with this buffer, so it can be exported. @@ -313,6 +373,9 @@ struct dma_buf *dma_buf_export_named(void *priv, const struct dma_buf_ops *ops, dmabuf->ops = ops; dmabuf->size = size; dmabuf->exp_name = exp_name; + + init_constraints(&dmabuf->constraints); + init_waitqueue_head(&dmabuf->poll); dmabuf->cb_excl.poll = dmabuf->cb_shared.poll = &dmabuf->poll; dmabuf->cb_excl.active = dmabuf->cb_shared.active = 0; @@ -422,7 +485,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev) { struct dma_buf_attachment *attach; - int ret; + int ret = 0;
if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL); @@ -436,6 +499,9 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
mutex_lock(&dmabuf->lock);
+ if (calc_constraints(dev, &dmabuf->constraints)) + goto err_constraints; + if (dmabuf->ops->attach) { ret = dmabuf->ops->attach(dmabuf, dev, attach); if (ret) @@ -448,6 +514,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
err_attach: kfree(attach); +err_constraints: mutex_unlock(&dmabuf->lock); return ERR_PTR(ret); } @@ -470,6 +537,8 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach);
+ recalc_constraints(dmabuf, &dmabuf->constraints); + mutex_unlock(&dmabuf->lock); kfree(attach); } @@ -770,6 +839,56 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, void *vaddr) } EXPORT_SYMBOL_GPL(dma_buf_vunmap);
+/** + * dma_buf_get_constraints - get the *current* constraints of the dmabuf, + * as calculated during each attach(); returns error on invalid inputs + * + * @dmabuf: [in] buffer to get constraints of + * @constraints: [out] current constraints are returned in this + */ +int dma_buf_get_constraints(struct dma_buf *dmabuf, + struct device_dma_parameters *constraints) +{ + if (WARN_ON(!dmabuf || !constraints)) + return -EINVAL; + + mutex_lock(&dmabuf->lock); + *constraints = dmabuf->constraints; + mutex_unlock(&dmabuf->lock); + return 0; +} +EXPORT_SYMBOL_GPL(dma_buf_get_constraints); + +/** + * dma_buf_recalc_constraints - *recalculate* the constraints for the buffer + * afresh, from the list of currently attached devices; this could be useful + * cross-check the current constraints, for exporters that might want to be + * 'paranoid' about the device constraints. + * + * returns error on invalid inputs + * + * @dmabuf: [in] buffer to get constraints of + * @constraints: [out] recalculated constraints are returned in this + */ +int dma_buf_recalc_constraints(struct dma_buf *dmabuf, + struct device_dma_parameters *constraints) +{ + struct device_dma_parameters calc_cons; + int ret = 0; + + if (WARN_ON(!dmabuf || !constraints)) + return -EINVAL; + + mutex_lock(&dmabuf->lock); + ret = recalc_constraints(dmabuf, &calc_cons); + if (!ret) + *constraints = calc_cons; + + mutex_unlock(&dmabuf->lock); + return ret; +} +EXPORT_SYMBOL_GPL(dma_buf_recalc_constraints); + #ifdef CONFIG_DEBUG_FS static int dma_buf_describe(struct seq_file *s) { @@ -801,6 +920,11 @@ static int dma_buf_describe(struct seq_file *s) buf_obj->file->f_flags, buf_obj->file->f_mode, file_count(buf_obj->file), buf_obj->exp_name); + seq_printf(s, "\tConstraints: Seg Count: %08u, Seg Size: %08u", + buf_obj->constraints.max_segment_count, + buf_obj->constraints.max_segment_size); + seq_printf(s, " seg boundary mask: %08lx\n", + buf_obj->constraints.segment_boundary_mask);
seq_puts(s, "\tAttached Devices:\n"); attach_count = 0; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 694e1fe1c4b4..489ad9b2e5ae 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -34,6 +34,7 @@ #include <linux/wait.h>
struct device; +struct device_dma_parameters; struct dma_buf; struct dma_buf_attachment;
@@ -116,6 +117,7 @@ struct dma_buf_ops { * @ops: dma_buf_ops associated with this buffer object. * @exp_name: name of the exporter; useful for debugging. * @list_node: node for dma_buf accounting and debugging. + * @constraints: calculated constraints of attached devices. * @priv: exporter specific private data for this buffer object. * @resv: reservation object linked to this dma-buf */ @@ -130,6 +132,7 @@ struct dma_buf { void *vmap_ptr; const char *exp_name; struct list_head list_node; + struct device_dma_parameters constraints; void *priv; struct reservation_object *resv;
@@ -211,4 +214,8 @@ void *dma_buf_vmap(struct dma_buf *); void dma_buf_vunmap(struct dma_buf *, void *vaddr); int dma_buf_debugfs_create_file(const char *name, int (*write)(struct seq_file *)); + +int dma_buf_get_constraints(struct dma_buf *, struct device_dma_parameters *); +int dma_buf_recalc_constraints(struct dma_buf *, + struct device_dma_parameters *); #endif /* __DMA_BUF_H__ */
Op 27-01-15 om 09:25 schreef Sumit Semwal:
Add some helpers to share the constraints of devices while attaching to the dmabuf buffer.
At each attach, the constraints are calculated based on the following:
- max_segment_size, max_segment_count, segment_boundary_mask from device_dma_parameters.
In case the attaching device's constraints don't match up, attach() fails.
At detach, the constraints are recalculated based on the remaining attached devices.
Two helpers are added:
- dma_buf_get_constraints - which gives the current constraints as calculated during each attach on the buffer till the time,
- dma_buf_recalc_constraints - which recalculates the constraints for all currently attached devices for the 'paranoid' ones amongst us.
The idea of this patch is largely taken from Rob Clark's RFC at https://lkml.org/lkml/2012/7/19/285, and the comments received on it.
Cc: Rob Clark robdclark@gmail.com Signed-off-by: Sumit Semwal sumit.semwal@linaro.org
v3:
- Thanks to Russell's comment, remove dma_mask and coherent_dma_mask from constraints' calculation; has a nice side effect of letting us use device_dma_parameters directly to list constraints.
- update the debugfs output to show constraint info as well.
v2: split constraints-sharing and allocation helpers
drivers/dma-buf/dma-buf.c | 126 +++++++++++++++++++++++++++++++++++++++++++++- include/linux/dma-buf.h | 7 +++ 2 files changed, 132 insertions(+), 1 deletion(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 5be225c2ba98..f363f1440803 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -264,6 +264,66 @@ static inline int is_dma_buf_file(struct file *file) return file->f_op == &dma_buf_fops; } +static inline void init_constraints(struct device_dma_parameters *cons) +{
- cons->max_segment_count = (unsigned int)-1;
- cons->max_segment_size = (unsigned int)-1;
- cons->segment_boundary_mask = (unsigned long)-1;
+}
Use DMA_SEGMENTS_MAX_SEG_COUNT or UINT/ULONG_MAX here instead?
Patches look sane, Reviewed-By: Maarten Lankhorst maarten.lankhorst@canonical.com
+/*
- calc_constraints - calculates if the new attaching device's constraints
- match, with the constraints of already attached devices; if yes, returns
- the constraints; else return ERR_PTR(-EINVAL)
- */
+static int calc_constraints(struct device *dev,
struct device_dma_parameters *calc_cons)
+{
- struct device_dma_parameters cons = *calc_cons;
- cons.max_segment_count = min(cons.max_segment_count,
dma_get_max_seg_count(dev));
- cons.max_segment_size = min(cons.max_segment_size,
dma_get_max_seg_size(dev));
- cons.segment_boundary_mask &= dma_get_seg_boundary(dev);
- if (!cons.max_segment_count ||
!cons.max_segment_size ||
!cons.segment_boundary_mask) {
pr_err("Dev: %s's constraints don't match\n", dev_name(dev));
return -EINVAL;
- }
- *calc_cons = cons;
- return 0;
+}
+/*
- recalc_constraints - recalculates constraints for all attached devices;
- useful for detach() recalculation, and for dma_buf_recalc_constraints()
- helper.
- Returns recalculated constraints in recalc_cons, or error in the unlikely
- case when constraints of attached devices might have changed.
- */
+static int recalc_constraints(struct dma_buf *dmabuf,
struct device_dma_parameters *recalc_cons)
+{
- struct device_dma_parameters calc_cons;
- struct dma_buf_attachment *attach;
- int ret = 0;
- init_constraints(&calc_cons);
- list_for_each_entry(attach, &dmabuf->attachments, node) {
ret = calc_constraints(attach->dev, &calc_cons);
if (ret)
return ret;
- }
- *recalc_cons = calc_cons;
- return 0;
+}
/**
- dma_buf_export_named - Creates a new dma_buf, and associates an anon file
- with this buffer, so it can be exported.
@@ -313,6 +373,9 @@ struct dma_buf *dma_buf_export_named(void *priv, const struct dma_buf_ops *ops, dmabuf->ops = ops; dmabuf->size = size; dmabuf->exp_name = exp_name;
- init_constraints(&dmabuf->constraints);
- init_waitqueue_head(&dmabuf->poll); dmabuf->cb_excl.poll = dmabuf->cb_shared.poll = &dmabuf->poll; dmabuf->cb_excl.active = dmabuf->cb_shared.active = 0;
@@ -422,7 +485,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev) { struct dma_buf_attachment *attach;
- int ret;
- int ret = 0;
if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL); @@ -436,6 +499,9 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, mutex_lock(&dmabuf->lock);
- if (calc_constraints(dev, &dmabuf->constraints))
goto err_constraints;
- if (dmabuf->ops->attach) { ret = dmabuf->ops->attach(dmabuf, dev, attach); if (ret)
@@ -448,6 +514,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, err_attach: kfree(attach); +err_constraints: mutex_unlock(&dmabuf->lock); return ERR_PTR(ret); } @@ -470,6 +537,8 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach);
- recalc_constraints(dmabuf, &dmabuf->constraints);
- mutex_unlock(&dmabuf->lock); kfree(attach);
} @@ -770,6 +839,56 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, void *vaddr) } EXPORT_SYMBOL_GPL(dma_buf_vunmap); +/**
- dma_buf_get_constraints - get the *current* constraints of the dmabuf,
- as calculated during each attach(); returns error on invalid inputs
- @dmabuf: [in] buffer to get constraints of
- @constraints: [out] current constraints are returned in this
- */
+int dma_buf_get_constraints(struct dma_buf *dmabuf,
struct device_dma_parameters *constraints)
+{
- if (WARN_ON(!dmabuf || !constraints))
return -EINVAL;
- mutex_lock(&dmabuf->lock);
- *constraints = dmabuf->constraints;
- mutex_unlock(&dmabuf->lock);
- return 0;
+} +EXPORT_SYMBOL_GPL(dma_buf_get_constraints);
+/**
- dma_buf_recalc_constraints - *recalculate* the constraints for the buffer
- afresh, from the list of currently attached devices; this could be useful
- cross-check the current constraints, for exporters that might want to be
- 'paranoid' about the device constraints.
- returns error on invalid inputs
- @dmabuf: [in] buffer to get constraints of
- @constraints: [out] recalculated constraints are returned in this
- */
+int dma_buf_recalc_constraints(struct dma_buf *dmabuf,
struct device_dma_parameters *constraints)
+{
- struct device_dma_parameters calc_cons;
- int ret = 0;
- if (WARN_ON(!dmabuf || !constraints))
return -EINVAL;
- mutex_lock(&dmabuf->lock);
- ret = recalc_constraints(dmabuf, &calc_cons);
- if (!ret)
*constraints = calc_cons;
- mutex_unlock(&dmabuf->lock);
- return ret;
+} +EXPORT_SYMBOL_GPL(dma_buf_recalc_constraints);
#ifdef CONFIG_DEBUG_FS static int dma_buf_describe(struct seq_file *s) { @@ -801,6 +920,11 @@ static int dma_buf_describe(struct seq_file *s) buf_obj->file->f_flags, buf_obj->file->f_mode, file_count(buf_obj->file), buf_obj->exp_name);
seq_printf(s, "\tConstraints: Seg Count: %08u, Seg Size: %08u",
buf_obj->constraints.max_segment_count,
buf_obj->constraints.max_segment_size);
seq_printf(s, " seg boundary mask: %08lx\n",
buf_obj->constraints.segment_boundary_mask);
seq_puts(s, "\tAttached Devices:\n"); attach_count = 0; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 694e1fe1c4b4..489ad9b2e5ae 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -34,6 +34,7 @@ #include <linux/wait.h> struct device; +struct device_dma_parameters; struct dma_buf; struct dma_buf_attachment; @@ -116,6 +117,7 @@ struct dma_buf_ops {
- @ops: dma_buf_ops associated with this buffer object.
- @exp_name: name of the exporter; useful for debugging.
- @list_node: node for dma_buf accounting and debugging.
*/
- @constraints: calculated constraints of attached devices.
- @priv: exporter specific private data for this buffer object.
- @resv: reservation object linked to this dma-buf
@@ -130,6 +132,7 @@ struct dma_buf { void *vmap_ptr; const char *exp_name; struct list_head list_node;
- struct device_dma_parameters constraints; void *priv; struct reservation_object *resv;
@@ -211,4 +214,8 @@ void *dma_buf_vmap(struct dma_buf *); void dma_buf_vunmap(struct dma_buf *, void *vaddr); int dma_buf_debugfs_create_file(const char *name, int (*write)(struct seq_file *));
+int dma_buf_get_constraints(struct dma_buf *, struct device_dma_parameters *); +int dma_buf_recalc_constraints(struct dma_buf *,
struct device_dma_parameters *);
#endif /* __DMA_BUF_H__ */
On Tue, Jan 27, 2015 at 01:55:54PM +0530, Sumit Semwal wrote:
+/*
- recalc_constraints - recalculates constraints for all attached devices;
- useful for detach() recalculation, and for dma_buf_recalc_constraints()
- helper.
- Returns recalculated constraints in recalc_cons, or error in the unlikely
- case when constraints of attached devices might have changed.
- */
Please see kerneldoc documentation for the proper format of these comments.
+static int recalc_constraints(struct dma_buf *dmabuf,
struct device_dma_parameters *recalc_cons)
+{
- struct device_dma_parameters calc_cons;
- struct dma_buf_attachment *attach;
- int ret = 0;
- init_constraints(&calc_cons);
- list_for_each_entry(attach, &dmabuf->attachments, node) {
ret = calc_constraints(attach->dev, &calc_cons);
if (ret)
return ret;
- }
- *recalc_cons = calc_cons;
- return 0;
+}
/**
- dma_buf_export_named - Creates a new dma_buf, and associates an anon file
- with this buffer, so it can be exported.
@@ -313,6 +373,9 @@ struct dma_buf *dma_buf_export_named(void *priv, const struct dma_buf_ops *ops, dmabuf->ops = ops; dmabuf->size = size; dmabuf->exp_name = exp_name;
- init_constraints(&dmabuf->constraints);
- init_waitqueue_head(&dmabuf->poll); dmabuf->cb_excl.poll = dmabuf->cb_shared.poll = &dmabuf->poll; dmabuf->cb_excl.active = dmabuf->cb_shared.active = 0;
@@ -422,7 +485,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev) { struct dma_buf_attachment *attach;
- int ret;
- int ret = 0;
if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL); @@ -436,6 +499,9 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, mutex_lock(&dmabuf->lock);
- if (calc_constraints(dev, &dmabuf->constraints))
goto err_constraints;
- if (dmabuf->ops->attach) { ret = dmabuf->ops->attach(dmabuf, dev, attach); if (ret)
@@ -448,6 +514,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, err_attach: kfree(attach); +err_constraints: mutex_unlock(&dmabuf->lock); return ERR_PTR(ret); } @@ -470,6 +537,8 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach);
- recalc_constraints(dmabuf, &dmabuf->constraints);
To me, this whole thing seems horribly racy.
What happens if subsystem X creates a dmabuf, which is passed to userspace. It's then passed to subsystem Y, which starts making use of it, calling dma_buf_map_attachment() on it.
The same buffer is also passed (via unix domain sockets) to another program, which then passes it independently into subsystem Z, and subsystem Z has more restrictive DMA constraints.
What happens at this point?
Subsystems such as DRM cache the scatter table, and return it for subsequent attach calls, so DRM drivers using the default drm_gem_map_dma_buf() implementation would not see the restrictions placed upon the dmabuf. Moreover, the returned scatterlist would not be modified for those restrictions either.
What would other subsystems do?
This needs more thought before it's merged.
For example, in the above situation, should we deny the ability to create a new attachment when a dmabuf has already been mapped by an existing attachment? Should we deny it only when the new attachment has more restrictive DMA constraints?
Please consider the possible sequences of use (such as the scenario above) when creating or augmenting an API.
Hi Russell!
On 29 January 2015 at 20:09, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Tue, Jan 27, 2015 at 01:55:54PM +0530, Sumit Semwal wrote:
+/*
- recalc_constraints - recalculates constraints for all attached devices;
- useful for detach() recalculation, and for dma_buf_recalc_constraints()
- helper.
- Returns recalculated constraints in recalc_cons, or error in the unlikely
- case when constraints of attached devices might have changed.
- */
Thanks for your valuable review comments!
Please see kerneldoc documentation for the proper format of these comments.
These are static functions, and as such kerneldoc doesn't enforce kernel-doc style comments, so in the dma-buf files, we've not followed them for static functions. That said, it is certainly a valuable advice, and I could create a separate patch-set for updating the documentation for the static functions as well.
+static int recalc_constraints(struct dma_buf *dmabuf,
struct device_dma_parameters *recalc_cons)
+{
struct device_dma_parameters calc_cons;
struct dma_buf_attachment *attach;
int ret = 0;
init_constraints(&calc_cons);
list_for_each_entry(attach, &dmabuf->attachments, node) {
ret = calc_constraints(attach->dev, &calc_cons);
if (ret)
return ret;
}
*recalc_cons = calc_cons;
return 0;
+}
/**
- dma_buf_export_named - Creates a new dma_buf, and associates an anon file
- with this buffer, so it can be exported.
@@ -313,6 +373,9 @@ struct dma_buf *dma_buf_export_named(void *priv, const struct dma_buf_ops *ops, dmabuf->ops = ops; dmabuf->size = size; dmabuf->exp_name = exp_name;
init_constraints(&dmabuf->constraints);
init_waitqueue_head(&dmabuf->poll); dmabuf->cb_excl.poll = dmabuf->cb_shared.poll = &dmabuf->poll; dmabuf->cb_excl.active = dmabuf->cb_shared.active = 0;
@@ -422,7 +485,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev) { struct dma_buf_attachment *attach;
int ret;
int ret = 0; if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL);
@@ -436,6 +499,9 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
mutex_lock(&dmabuf->lock);
if (calc_constraints(dev, &dmabuf->constraints))
goto err_constraints;
if (dmabuf->ops->attach) { ret = dmabuf->ops->attach(dmabuf, dev, attach); if (ret)
@@ -448,6 +514,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf,
err_attach: kfree(attach); +err_constraints: mutex_unlock(&dmabuf->lock); return ERR_PTR(ret); } @@ -470,6 +537,8 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach);
recalc_constraints(dmabuf, &dmabuf->constraints);
To me, this whole thing seems horribly racy.
What happens if subsystem X creates a dmabuf, which is passed to userspace. It's then passed to subsystem Y, which starts making use of it, calling dma_buf_map_attachment() on it.
The same buffer is also passed (via unix domain sockets) to another program, which then passes it independently into subsystem Z, and subsystem Z has more restrictive DMA constraints.
What happens at this point?
Subsystems such as DRM cache the scatter table, and return it for subsequent attach calls, so DRM drivers using the default drm_gem_map_dma_buf() implementation would not see the restrictions placed upon the dmabuf. Moreover, the returned scatterlist would not be modified for those restrictions either.
What would other subsystems do?
This needs more thought before it's merged.
For example, in the above situation, should we deny the ability to create a new attachment when a dmabuf has already been mapped by an existing attachment? Should we deny it only when the new attachment has more restrictive DMA constraints?
So, short answer is, it is left to the exporter to decide. The dma-buf framework should not even attempt to decide or enforce any of the above.
At each dma_buf_attach(), there's a callback to the exporter, where the exporter can decide, if it intends to handle these kind of cases, on the best way forward.
The exporter might, for example, decide to migrate backing storage, should there be a need to do so, or simply deny when the new attachment has more restrictive DMA constraints, as you mentioned as a possibility.
These changes simply allow the exporter, should it wish to, to take the DMA constraints into consideration while making those decisions. For the current cases, it should not even matter if the DMA constraints aren't shared by the devices.
Please consider the possible sequences of use (such as the scenario above) when creating or augmenting an API.
I tried to think of the scenarios I could think of, but If you still feel this approach doesn't help with your concerns, I'll graciously accept advice to improve it.
Once again, thanks for reviewing these changes!
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
Best regards, ~Sumit.
On Thu, Jan 29, 2015 at 09:00:11PM +0530, Sumit Semwal wrote:
So, short answer is, it is left to the exporter to decide. The dma-buf framework should not even attempt to decide or enforce any of the above.
At each dma_buf_attach(), there's a callback to the exporter, where the exporter can decide, if it intends to handle these kind of cases, on the best way forward.
The exporter might, for example, decide to migrate backing storage,
That's a decision which the exporter can not take. Think about it...
If subsystem Y has mapped the buffer, it could be accessing the buffer's backing storage at the same time that subsystem Z tries to attach to the buffer.
Once the buffer has been exported to another user, the exporter has effectively lost control over mediating accesses to that buffer.
All that it can do with the way the dma-buf API is today is to allocate a _different_ scatter list pointing at the same backing storage which satisfies the segment size and number of segments, etc.
There's also another issue which you haven't addressed. What if several attachments result in lowering max_segment_size and max_segment_count such that:
max_segment_size * max_segment_count < dmabuf->size
but individually, the attachments allow dmabuf->size to be represented as a scatterlist?
If an exporter were to take notice of the max_segment_size and max_segment_count, the resulting buffer is basically unrepresentable as a scatterlist.
Please consider the possible sequences of use (such as the scenario above) when creating or augmenting an API.
I tried to think of the scenarios I could think of, but If you still feel this approach doesn't help with your concerns, I'll graciously accept advice to improve it.
See the new one above :)
On 29 January 2015 at 21:17, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Thu, Jan 29, 2015 at 09:00:11PM +0530, Sumit Semwal wrote:
So, short answer is, it is left to the exporter to decide. The dma-buf framework should not even attempt to decide or enforce any of the above.
At each dma_buf_attach(), there's a callback to the exporter, where the exporter can decide, if it intends to handle these kind of cases, on the best way forward.
The exporter might, for example, decide to migrate backing storage,
That's a decision which the exporter can not take. Think about it...
If subsystem Y has mapped the buffer, it could be accessing the buffer's backing storage at the same time that subsystem Z tries to attach to the buffer.
Well, first up, of course the 'migration of backing storage' is an orthogonal problem to what this patchset attempts to do - in this, I am only try to make the relevant information available to the exporter.
With that out of the way, some thoughts on what you mentioned:
So, IF the exporter needs to support migration of backing storage, even when subsystem Y has mapped the buffer, the exporter knows this (because of the map_dma_buf() dma_buf_op) - and the attach() also is notified to / handled by the exporter. With this information, it could either: a) not let the subsystem Z attach (the 'simpler' approach), or b) hold enough state-information about the Z's attach request internally, then migrate the pages on the unmap_attachment() callback from the subsystem Y?
(The exact details for this will need to be thought-of by exporters actually trying to do migration of pages, or delayed allocation, or such, though)
Once the buffer has been exported to another user, the exporter has effectively lost control over mediating accesses to that buffer.
All that it can do with the way the dma-buf API is today is to allocate a _different_ scatter list pointing at the same backing storage which satisfies the segment size and number of segments, etc.
There's also another issue which you haven't addressed. What if several attachments result in lowering max_segment_size and max_segment_count such that:
max_segment_size * max_segment_count < dmabuf->size
but individually, the attachments allow dmabuf->size to be represented as a scatterlist?
If an exporter were to take notice of the max_segment_size and max_segment_count, the resulting buffer is basically unrepresentable as a scatterlist.
Thanks for pointing that out; I guess we'd have to disallow the attachment which would make that happen. I can add this as another check in calc_constraints().
Please consider the possible sequences of use (such as the scenario above) when creating or augmenting an API.
I tried to think of the scenarios I could think of, but If you still feel this approach doesn't help with your concerns, I'll graciously accept advice to improve it.
See the new one above :)
Another thanks for making me rack my puny brain on these scenarios! :) [though I strongly suspect I might not have done enough!]
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
BR, ~Sumit.
On Thu, Jan 29, 2015 at 10:47 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Thu, Jan 29, 2015 at 09:00:11PM +0530, Sumit Semwal wrote:
So, short answer is, it is left to the exporter to decide. The dma-buf framework should not even attempt to decide or enforce any of the above.
At each dma_buf_attach(), there's a callback to the exporter, where the exporter can decide, if it intends to handle these kind of cases, on the best way forward.
The exporter might, for example, decide to migrate backing storage,
That's a decision which the exporter can not take. Think about it...
If subsystem Y has mapped the buffer, it could be accessing the buffer's backing storage at the same time that subsystem Z tries to attach to the buffer.
The *theory* is that Y is map/unmap'ing the buffer around each use, so there will be some point where things could be migrated and remapped.. in practice, I am not sure that anyone is doing this yet.
Probably it would be reasonable if a more restrictive subsystem tried to attach after the buffer was already allocated and mapped in a way that don't meet the new constraints, then -EBUSY.
But from a quick look it seems like there needs to be a slight fixup to not return 0 if calc_constraints() fails..
Once the buffer has been exported to another user, the exporter has effectively lost control over mediating accesses to that buffer.
All that it can do with the way the dma-buf API is today is to allocate a _different_ scatter list pointing at the same backing storage which satisfies the segment size and number of segments, etc.
There's also another issue which you haven't addressed. What if several attachments result in lowering max_segment_size and max_segment_count such that:
max_segment_size * max_segment_count < dmabuf->size
but individually, the attachments allow dmabuf->size to be represented as a scatterlist?
Quite possibly for some of these edge some of cases, some of the dma-buf exporters are going to need to get more clever (ie. hand off different scatterlists to different clients). Although I think by far the two common cases will be "I can support anything via an iommu/mmu" and "I need phys contig".
But that isn't an issue w/ dma-buf itself, so much as it is an issue w/ drivers. I guess there would be more interest in fixing up drivers when actual hw comes along that needs it..
BR, -R
If an exporter were to take notice of the max_segment_size and max_segment_count, the resulting buffer is basically unrepresentable as a scatterlist.
Please consider the possible sequences of use (such as the scenario above) when creating or augmenting an API.
I tried to think of the scenarios I could think of, but If you still feel this approach doesn't help with your concerns, I'll graciously accept advice to improve it.
See the new one above :)
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
On Thu, Jan 29, 2015 at 01:52:09PM -0500, Rob Clark wrote:
Quite possibly for some of these edge some of cases, some of the dma-buf exporters are going to need to get more clever (ie. hand off different scatterlists to different clients). Although I think by far the two common cases will be "I can support anything via an iommu/mmu" and "I need phys contig".
But that isn't an issue w/ dma-buf itself, so much as it is an issue w/ drivers. I guess there would be more interest in fixing up drivers when actual hw comes along that needs it..
However, validating the attachments is the business of dma-buf. This is actual infrastructure, which should ensure some kind of sanity such as the issues I've raised.
The whole "we can push it onto our users" is really on - what that results in is the users ignoring most of the requirements and just doing their own thing, which ultimately ends up with the whole thing turning into a disgusting mess - one which becomes very difficult to fix later.
Now, if we're going to do the "more clever" thing you mention above, that rather negates the point of this two-part patch set, which is to provide the union of the DMA capabilities of all users. A union in that case is no longer sane as we'd be tailoring the SG lists to each user.
If we aren't going to do the "more clever" thing, then yes, we need this code to calculate that union, but we _also_ need it to do sanity checking right from the start, and refuse conditions which ultimately break the ability to make use of that union - in other words, when the union of the DMA capabilities means that the dmabuf can't be represented.
Unless we do that, we'll just end up with random drivers interpreting what they want from the DMA capabilities, and we'll have some drivers exporting (eg) scatterlists which satisfy the maximum byte size of an element, but ignoring the maximum number of entries or vice versa, and that'll most probably hide the case of "too small a union".
It really doesn't make sense to do both either: that route is even more madness, because we'll end up with two classes of drivers - those which use the union approach, and those which don't.
The KISS principle applies here.
On Thu, Jan 29, 2015 at 2:26 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Thu, Jan 29, 2015 at 01:52:09PM -0500, Rob Clark wrote:
Quite possibly for some of these edge some of cases, some of the dma-buf exporters are going to need to get more clever (ie. hand off different scatterlists to different clients). Although I think by far the two common cases will be "I can support anything via an iommu/mmu" and "I need phys contig".
But that isn't an issue w/ dma-buf itself, so much as it is an issue w/ drivers. I guess there would be more interest in fixing up drivers when actual hw comes along that needs it..
However, validating the attachments is the business of dma-buf. This is actual infrastructure, which should ensure some kind of sanity such as the issues I've raised.
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That all said, I think probably all the existing exporters cache the sg-list. And I can't think of any actual hw which would hit this problem that can be solved by multiple sg-lists for the same physical memory. (And the constraint calculation kind of assumes the end result will be a single sg-list.) So it seems reasonable to me to check that max_segment_count * max_segment_size is not smaller than the buffer.
If it was a less theoretical problem, I think I'd more inclined for a way that the exporter could override the checks, or something along those lines.
otoh, if the attachment is just not possible because the buffer has been already allocated and mapped by someone with more relaxed constraints.. then I think the driver should be the one returning the error since dma-buf doesn't know this.
The whole "we can push it onto our users" is really on - what that results in is the users ignoring most of the requirements and just doing their own thing, which ultimately ends up with the whole thing turning into a disgusting mess - one which becomes very difficult to fix later.
Ideally at some point, dma-mapping or some helpers would support allocations matching constraints.. I think only actual gpu drivers want to do crazy enough things that they'd want to bypass dma-mapping. If everyone else can use dma-mapping and/or helpers then we make it harder for drivers to do the wrong thing than to do the right thing.
Now, if we're going to do the "more clever" thing you mention above, that rather negates the point of this two-part patch set, which is to provide the union of the DMA capabilities of all users. A union in that case is no longer sane as we'd be tailoring the SG lists to each user.
It doesn't really negate.. a different sg list representing the same physical memory cannot suddenly make the buffer physically contiguous (from the perspective of memory)..
(unless we are not on the same page here, so to speak)
BR, -R
If we aren't going to do the "more clever" thing, then yes, we need this code to calculate that union, but we _also_ need it to do sanity checking right from the start, and refuse conditions which ultimately break the ability to make use of that union - in other words, when the union of the DMA capabilities means that the dmabuf can't be represented.
Unless we do that, we'll just end up with random drivers interpreting what they want from the DMA capabilities, and we'll have some drivers exporting (eg) scatterlists which satisfy the maximum byte size of an element, but ignoring the maximum number of entries or vice versa, and that'll most probably hide the case of "too small a union".
It really doesn't make sense to do both either: that route is even more madness, because we'll end up with two classes of drivers - those which use the union approach, and those which don't.
The KISS principle applies here.
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
On Thu, Jan 29, 2015 at 05:18:33PM -0500, Rob Clark wrote:
On Thu, Jan 29, 2015 at 2:26 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
Now, if we're going to do the "more clever" thing you mention above, that rather negates the point of this two-part patch set, which is to provide the union of the DMA capabilities of all users. A union in that case is no longer sane as we'd be tailoring the SG lists to each user.
It doesn't really negate.. a different sg list representing the same physical memory cannot suddenly make the buffer physically contiguous (from the perspective of memory)..
(unless we are not on the same page here, so to speak)
If we are really only interested in the "physically contiguous" vs "scattered" differentiation, why can't this be just a simple flag?
I think I know where you're coming from on that distinction - most GPUs can cope with their buffers being discontiguous in memory, but scanout and capture hardware tends to need contiguous buffers.
My guess is that you're looking for some way that a GPU driver could allocate a buffer, which can then be imported into the scanout hardware - and when it is, the underlying backing store is converted to a contiguous buffer. Is that the usage scenario you're thinking of?
On Thu, Jan 29, 2015 at 5:31 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Thu, Jan 29, 2015 at 05:18:33PM -0500, Rob Clark wrote:
On Thu, Jan 29, 2015 at 2:26 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
Now, if we're going to do the "more clever" thing you mention above, that rather negates the point of this two-part patch set, which is to provide the union of the DMA capabilities of all users. A union in that case is no longer sane as we'd be tailoring the SG lists to each user.
It doesn't really negate.. a different sg list representing the same physical memory cannot suddenly make the buffer physically contiguous (from the perspective of memory)..
(unless we are not on the same page here, so to speak)
If we are really only interested in the "physically contiguous" vs "scattered" differentiation, why can't this be just a simple flag?
I'd be fine with that.. I was trying to make it a bit less of a point solution, but maybe trying to be too generic is not worth it..
There is apparently some hw which has iommu's but small # of tlb entries, and would prefer partially contiguous buffers. But that isn't a hard constraint, and maybe shouldn't be solved w/ max_segment_count. And I'm not sure how common that is.
I think I know where you're coming from on that distinction - most GPUs can cope with their buffers being discontiguous in memory, but scanout and capture hardware tends to need contiguous buffers.
My guess is that you're looking for some way that a GPU driver could allocate a buffer, which can then be imported into the scanout hardware - and when it is, the underlying backing store is converted to a contiguous buffer. Is that the usage scenario you're thinking of?
Pretty much.. and maybe a few slight permutations on that involving cameras / video codecs / etc. But the really-really common case is gpu (with mmu/iommu) + display (without). Just solving this problem would be a really good first step.
BR, -R
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
On Thu, Jan 29, 2015 at 05:18:33PM -0500, Rob Clark wrote:
On Thu, Jan 29, 2015 at 2:26 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Thu, Jan 29, 2015 at 01:52:09PM -0500, Rob Clark wrote:
Quite possibly for some of these edge some of cases, some of the dma-buf exporters are going to need to get more clever (ie. hand off different scatterlists to different clients). Although I think by far the two common cases will be "I can support anything via an iommu/mmu" and "I need phys contig".
But that isn't an issue w/ dma-buf itself, so much as it is an issue w/ drivers. I guess there would be more interest in fixing up drivers when actual hw comes along that needs it..
However, validating the attachments is the business of dma-buf. This is actual infrastructure, which should ensure some kind of sanity such as the issues I've raised.
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
That all said, I think probably all the existing exporters cache the sg-list. And I can't think of any actual hw which would hit this problem that can be solved by multiple sg-lists for the same physical memory. (And the constraint calculation kind of assumes the end result will be a single sg-list.) So it seems reasonable to me to check that max_segment_count * max_segment_size is not smaller than the buffer.
If it was a less theoretical problem, I think I'd more inclined for a way that the exporter could override the checks, or something along those lines.
otoh, if the attachment is just not possible because the buffer has been already allocated and mapped by someone with more relaxed constraints.. then I think the driver should be the one returning the error since dma-buf doesn't know this.
Importers currently cache the mapped sg list aggressively (i915) or outright pin it for as long as possible (everyone else). So any kind of moving stuff around is pretty much impossible with current drivers.
The even worse violation of the dma-buf spec is that all the ttm drivers don't use the sg table correctly at all. They assume that each physical page has exactly one sg table entry, and then fish out the struct page * pointer from that to build up their own bo management stuff and ignore everything else.
The whole "we can push it onto our users" is really on - what that results in is the users ignoring most of the requirements and just doing their own thing, which ultimately ends up with the whole thing turning into a disgusting mess - one which becomes very difficult to fix later.
Ideally at some point, dma-mapping or some helpers would support allocations matching constraints.. I think only actual gpu drivers want to do crazy enough things that they'd want to bypass dma-mapping. If everyone else can use dma-mapping and/or helpers then we make it harder for drivers to do the wrong thing than to do the right thing.
Now, if we're going to do the "more clever" thing you mention above, that rather negates the point of this two-part patch set, which is to provide the union of the DMA capabilities of all users. A union in that case is no longer sane as we'd be tailoring the SG lists to each user.
It doesn't really negate.. a different sg list representing the same physical memory cannot suddenly make the buffer physically contiguous (from the perspective of memory)..
(unless we are not on the same page here, so to speak)
Or someone was not chip and put a decent iommu in front of the same IP block. E.g. the raspi gpu needs contiguous memory for rendering, but the same block is used elsewhere but then with an iommu.
But thinking about all this I wonder whether we really should start with some kind of constraint solving. It feels fairly leaky compared to the encapsulation the dma api provides, and so isn't really better for upstream than just using ion (which completely gives up on this problem and relies on userspace allocating correct buffers).
And if we step away for a bit there's already a bunch of things that the current dma api fails at, and which is just a bit a worse problem with dma-buf sharing: There's not really a generic way to map an sg table zero-copy, i.e. there's no generic way to avoid bounce buffers. And that's already hurting all the existing gpu drivers: ttm essentially does page-sized allocs for everything and then has it's own dma allocator on top of that page-cache. i915 has some other hacks to at least not fail the bounce buffer allocator too badly. Grepping for SWIOTLB in drm is fairly interesting.
So imo if our goal is to keep the abstraction provided by the dma api somewhat intact we should first figure out to map an sg table without copying any data. If we have that any exporter can then easily check whether an attachment works out by test-mapping stuff. A bit inefficient, but at least possible (and exporters could cache the mapped sg table if so desired). And we could rip out a pile of hacks from drm drivers.
The other problem is is coherency management. Even in the single-device usecase current dma-buf isn't up to things since on many socs the same device can use both coherent and non-coherent transactions. And if you map the same memory multiple times we don't really want to flush cpu caches multiple times (ppl alreay have massive caches and stuff to avoid the cpu cache flush when there's just one device using the buffer object). Again this probably needs some core dma api extensions. And again we could probably throw away a bunch of driver code (at least in i915, unfortunately there's no intel v4l driver to test non-coherent sharing on x86).
With all that resolved somehow (and all these issues already affect single-device dma api usage by gpu drivers) the bit left would be figuring out where to allocate things. And I'm not even sure whether we should really bother to implement this in the kernel (no matter how I slice it it always looks like we're leaking the dma api abstractions) but just with a few tries in userspace:
- Allocate shared buffers from the scanout engine (or v4l, though that doesn't yet support exporting).
- If you can't import, try it the other way round. This isn't perfect for cross-process + cross-device bo sharing, but then pretty much all compositors have a gpu rendering fallback anyway because this problem is a bit too complex to solve perfectly. Of course for the next frame compositor can provide a new buffer which hopefully works out better.
- Userspace just knows where to allocate. Imo that's not actually unreasonable since if you really have that tricky requirements you probably also have insane buffer layouts and then all bets for generic code are off anyway.
Cheers, Daniel
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
BR, -R
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
That's incorrect - and goes dead against the design of scatterlists.
Not only that, but it is entirely possible that you may get handed memory via dmabufs for which there are no struct page's associated with that memory - think about display systems which have their own video memory which is accessible to the GPU, but it isn't system memory.
In those circumstances, you have to use the dma_addr_t's and not the pages.
On Mon, Feb 2, 2015 at 4:46 PM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
That's incorrect - and goes dead against the design of scatterlists.
yeah, a bit of an abuse, although I'm not sure I see a much better way when a device vaddr depends on user context..
Not only that, but it is entirely possible that you may get handed memory via dmabufs for which there are no struct page's associated with that memory - think about display systems which have their own video memory which is accessible to the GPU, but it isn't system memory.
well, I guess anyways when it comes to sharing buffers, it won't be the vram placement of the bo that gets shared ;-)
BR, -R
In those circumstances, you have to use the dma_addr_t's and not the pages.
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
On Mon, Feb 02, 2015 at 05:36:10PM -0500, Rob Clark wrote:
well, I guess anyways when it comes to sharing buffers, it won't be the vram placement of the bo that gets shared ;-)
Actually that's pretty much what I'd like to be able to do for i915. Except that vram is just a specially protected chunk of main memory. -Daniel
On Mon, Feb 02, 2015 at 09:46:16PM +0000, Russell King - ARM Linux wrote:
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
That's incorrect - and goes dead against the design of scatterlists.
Not only that, but it is entirely possible that you may get handed memory via dmabufs for which there are no struct page's associated with that memory - think about display systems which have their own video memory which is accessible to the GPU, but it isn't system memory.
In those circumstances, you have to use the dma_addr_t's and not the pages.
Yeah exactly. At least with i915 we'd really want to be able to share stolen memory in some cases, and since that's stolen there's no struct pages for them. On top of that any cpu access is also blocked to that range in the memory controller, so the dma_addr_t is really the _only_ thing you can use to get at those memory ranges. And afaik the camera pipe on intel soc can get there - unfortunately that one doesn't have an upstream driver :(
And just to clarify: All current dma-buf exporter that I've seen implement the sg mapping correctly and _do_ map the sg table into device address space with dma_map_sg. In other words: The dma_addr_t are all valid, it's just that e.g. with ttm no one has bothered to teach ttm a dma-buf correctly. The internal abstraction is all there, ttm-internal buffer object interface match what dma-buf exposes fairly closes (hey I didn't do shit when designing those interfaces ;-) -Daniel
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
Hm, what does per-process/context page tables have to do here? At least on i915 we have a two levels of page tables: - first level for vm/device isolation, used through dma api - 2nd level for per-gpu-context isolation and context switching, handled internally.
Since atm the dma api doesn't have any context of contexts or different pagetables, I don't see who you could use that at all. -Daniel
On Tue, Feb 03, 2015 at 08:48:56AM +0100, Daniel Vetter wrote:
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
Hm, what does per-process/context page tables have to do here? At least on i915 we have a two levels of page tables:
- first level for vm/device isolation, used through dma api
- 2nd level for per-gpu-context isolation and context switching, handled internally.
Since atm the dma api doesn't have any context of contexts or different pagetables, I don't see who you could use that at all.
What I've found with *my* etnaviv drm implementation (not Christian's - I found it impossible to work with Christian, especially with the endless "msm doesn't do it that way, so we shouldn't" responses and his attitude towards cherry-picking my development work [*]) is that it's much easier to keep the GPU MMU local to the GPU and under the control of the DRM MM code, rather than attaching the IOMMU to the DMA API and handling it that way.
There are several reasons for that:
1. DRM has a better idea about when the memory needs to be mapped to the GPU, and it can more effectively manage the GPU MMU.
2. The GPU MMU may have TLBs which can only be flushed via a command in the GPU command stream, so it's fundamentally necessary for the MMU to be managed by the GPU driver so that it knows when (and how) to insert the flushes.
* - as a direct result of that, I've stopped all further development of etnaviv drm, and I'm intending to strip it out from my Xorg DDX driver as the etnaviv drm API which Christian wants is completely incompatible with the non-etnaviv drm, and that just creates far too much pain in the DDX driver.
On Tue, Feb 03, 2015 at 12:28:14PM +0000, Russell King - ARM Linux wrote:
On Tue, Feb 03, 2015 at 08:48:56AM +0100, Daniel Vetter wrote:
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
Hm, what does per-process/context page tables have to do here? At least on i915 we have a two levels of page tables:
- first level for vm/device isolation, used through dma api
- 2nd level for per-gpu-context isolation and context switching, handled internally.
Since atm the dma api doesn't have any context of contexts or different pagetables, I don't see who you could use that at all.
What I've found with *my* etnaviv drm implementation (not Christian's - I found it impossible to work with Christian, especially with the endless "msm doesn't do it that way, so we shouldn't" responses and his attitude towards cherry-picking my development work [*]) is that it's much easier to keep the GPU MMU local to the GPU and under the control of the DRM MM code, rather than attaching the IOMMU to the DMA API and handling it that way.
There are several reasons for that:
DRM has a better idea about when the memory needs to be mapped to the GPU, and it can more effectively manage the GPU MMU.
The GPU MMU may have TLBs which can only be flushed via a command in the GPU command stream, so it's fundamentally necessary for the MMU to be managed by the GPU driver so that it knows when (and how) to insert the flushes.
3. Switching between different address spaces (for per gpu context isolation) often requires intricate knowledge about the gpu and close coordination. Well maybe just a part of 2 really, but an important one.
Fully agree and tbh I'm not sure whether the current push in arm to expose all gpu mmus as iommus is solid. Even for pasid (per-context iommu tables) which is a big official standard there's still a lot of open questions about how to do it properly. And it requires strict hw support so that the hw always knows which pasid it should use for a given dma access. -Daniel
2015-02-03 13:28 GMT+01:00 Russell King - ARM Linux linux@arm.linux.org.uk:
On Tue, Feb 03, 2015 at 08:48:56AM +0100, Daniel Vetter wrote:
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
Hm, what does per-process/context page tables have to do here? At least on i915 we have a two levels of page tables:
- first level for vm/device isolation, used through dma api
- 2nd level for per-gpu-context isolation and context switching, handled internally.
Since atm the dma api doesn't have any context of contexts or different pagetables, I don't see who you could use that at all.
What I've found with *my* etnaviv drm implementation (not Christian's - I found it impossible to work with Christian, especially with the endless "msm doesn't do it that way, so we shouldn't" responses and his attitude towards cherry-picking my development work [*]) is that it's much easier to keep the GPU MMU local to the GPU and under the control of the DRM MM code, rather than attaching the IOMMU to the DMA API and handling it that way.
Keep in mind that I tried to reach you several times via mail and irc and you simply ignored me. Did you know that took almost all of your patches (with small changes)? And I needed to cherry pick you patches as they were a) wrong, b) solved in a different way or c) had "hack" in the subject. I am quite sorry that I ended that way, but it is not only my fault!
There are several reasons for that:
DRM has a better idea about when the memory needs to be mapped to the GPU, and it can more effectively manage the GPU MMU.
The GPU MMU may have TLBs which can only be flushed via a command in the GPU command stream, so it's fundamentally necessary for the MMU to be managed by the GPU driver so that it knows when (and how) to insert the flushes.
- as a direct result of that, I've stopped all further development of
etnaviv drm, and I'm intending to strip it out from my Xorg DDX driver as the etnaviv drm API which Christian wants is completely incompatible with the non-etnaviv drm, and that just creates far too much pain in the DDX driver.
That is bad, but life moves on.
greets -- Christian Gmeiner, MSc
On Tue, Feb 03, 2015 at 02:28:26PM +0100, Christian Gmeiner wrote:
2015-02-03 13:28 GMT+01:00 Russell King - ARM Linux linux@arm.linux.org.uk:
What I've found with *my* etnaviv drm implementation (not Christian's - I found it impossible to work with Christian, especially with the endless "msm doesn't do it that way, so we shouldn't" responses and his attitude towards cherry-picking my development work [*]) is that it's much easier to keep the GPU MMU local to the GPU and under the control of the DRM MM code, rather than attaching the IOMMU to the DMA API and handling it that way.
Keep in mind that I tried to reach you several times via mail and irc and you simply ignored me. Did you know that took almost all of your patches (with small changes)? And I needed to cherry pick you patches as they were a) wrong, b) solved in a different way or c) had "hack" in the subject. I am quite sorry that I ended that way, but it is not only my fault!
Exactly - you *took* every patch that I published whether I was ready for you to take it or not.
That's not how kernel development works. Kernel development works by people working on the code, and *pushing* patches or git trees upstream. It doesn't work by having people running around *taking* patches from people just because they feel like it.
I asked you several times not to do that which means the only way I can control you is by *not* publishing my changes, thereby denying other people the ability to test my changes.
Another result of you *taking* patches from me is that you totally *screwed* my ability to work with you. If you make this stuff unnecessary hard, you can expect people to walk away, and that's precisely what I've done.
There's also the issue of you *taking* my patches and then applying them to your tree with your own modifications, again *screwing* my tree, and screwing my ability - again - to work with you.
Many of my patches in your repository are also marked as you being the author of them... which _really_ is not nice.
Your "review" comments of "based 1:1 on the MSM driver" were really crazy. Just because one DRM driver does something one way does not make it the only way, nor does it make it suitable for use everywhere, even if you modelled your driver on MSM. It certainly doesn't mean that the way the MSM driver does it is correct either.
And frankly, you calling my patches "wrong" is laughable. I have a stable fully functional Xorg DDX driver here which works with my version of your etnaviv DRM across several Vivante GPUs - GC660 on Dove, and two revisions of GC320 on iMX6 (which are notoriously buggy). No kernel oopses, no GPU lockups. I've had machines with uptimes of a month here with it, with the driver being exercised frequently during that period.
You refused to take things such as the DMA address monitoring for GPU hangups to stop it mis-firing. This _is_ a bug fix. Without that, your driver spits out random GPU hangups when there isn't actually any hangup at all. *Reliably* so. Your excuse for not taking it was "The current vivante driver doesn't do that." While that's true, the V2 Vivante drivers _do_ do it in their "guard thread" - and they do it because - as I already explained to you - it's entirely possible for the GPU to take a long time, longer than your hangcheck timeout, to render a series of 1080p operations. And again, not everything that the Vivante drivers do is correct either. Jon and myself know that very well having spent a long time debugging their GPL'd offerings.
Even the "hack" patch was mostly correct - the reason that it is labelled as a "hack" is because - as the commit log said - it should really be done by the MMUv1 code, but that requires your entire IOMMU _bodge_ to be rewritten properly. Even the Vivante drivers use that region when they can.
Then there's also the compatibility with the etnaviv library - which is an important thing if you want people to make use of your driver. You applied the patches for and then reverted which completely screws the Xorg DDX driver, making it impossible to support both etnaviv and etnadrm without having two almost identical copies of the same code. I don't want to maintain almost identical copies of that same code, and no one in their right mind would want that.
Having some level of sane user compatibility between etnaviv and etnaviv drm will _gain_ you uses as it will allow people to write code which works on both platforms - and it's really not difficult to do. (In fact, I've proven it by writing a shim layer between the etnaviv API and the DRM interfaces in the DDX driver.)
Then there's the round-robin IOMMU address space allocation, which is required to avoid corrupted pixmaps (which is something that _your_ driver does very very well - in fact, so well that it corrupts system memory), and the reaping of the IOMMU space when we run out of IOMMU space to map.
Now, on to things that you do wrong.
There's your bodge with the component helper too which I'm disgusted with, and I'll tell you now - your use of the component helper is wrong.
In your latest repository, there's this reserved-memory thing which you've invented - to work around iMX6 with 2GB of RAM allocating the command ring buffer at physical addresses >= 0x80000000. That's absolutely not necessary, the GPU has physical offset registers which can be used to program the lower 2G of MMUv1 space at the appropriate offset - and there's good reasons to do that - it prevents the GPU being able to access 0x00000000-0x10000000 which are where the peripheral registers are on the iMX6 - it prevents the GPU being able to scribble into (eg) the SDRAM controller registers etc potentially taking the system down. Yes, it won't work if we see 3G on iMX6, but that's something which can be addressed in other ways (such as passing appropriate GFP_* flags.)
Here's a reminder of the kind of "high quality" review comments you provided:
* staging: etnaviv: ensure cleanup of reservation object commit 729b31a8dd07d5db744cc5cb32c28ebf2e8cadb5
This is based 1:1 on msm drm driver. I will talk with Rob about it and will postpone this patch.
* staging: etnaviv: implement MMU reaping commit 0c7f0736cc02ba83dea15d73e9fa98277839ca67
I will _NOT_ take this one.
* staging: etnaviv: stop the hangcheck timer mis-firing commit f0c3d8f2bf81774aaf19284ec3287e343a772e63
Too hacky, current vivante drivers do not have this hack. I will _NOT_ take this one.
* staging: etnaviv: increase iommu page table size to 512KiB commit 610da4c465732f1c20495f8d7d9504ec25665bb0
I will _NOT_ take this one. I will postpone it.
* staging: etnaviv: fix fence wrapping for gem objects commit 218e8ffce1d57a797e7f3234cd8ffd62fc4dab71
This is based 1:1 on msm drm driver. I will talk with Rob about it and will postpone this patch.
On Tue, Feb 3, 2015 at 7:28 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Tue, Feb 03, 2015 at 08:48:56AM +0100, Daniel Vetter wrote:
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
Hm, what does per-process/context page tables have to do here? At least on i915 we have a two levels of page tables:
- first level for vm/device isolation, used through dma api
- 2nd level for per-gpu-context isolation and context switching, handled internally.
Since atm the dma api doesn't have any context of contexts or different pagetables, I don't see who you could use that at all.
What I've found with *my* etnaviv drm implementation (not Christian's - I found it impossible to work with Christian, especially with the endless "msm doesn't do it that way, so we shouldn't" responses and his attitude towards cherry-picking my development work [*]) is that it's much easier to keep the GPU MMU local to the GPU and under the control of the DRM MM code, rather than attaching the IOMMU to the DMA API and handling it that way.
There are several reasons for that:
DRM has a better idea about when the memory needs to be mapped to the GPU, and it can more effectively manage the GPU MMU.
The GPU MMU may have TLBs which can only be flushed via a command in the GPU command stream, so it's fundamentally necessary for the MMU to be managed by the GPU driver so that it knows when (and how) to insert the flushes.
If gpu mmu needs some/all updates to happen from command-stream then probably better to handle it internally..
That is a slightly different scenario from msm, where we have many instances of the same iommu[*] scattered through the SoC in front of various different devices.
BR, -R
[*] at least from iommu register layout, same driver is used for all instances.. but maybe the tlb+walker are maybe more tightly integrated to the gpu, but that is just speculation on implementation details based on some paper I found along the way
- as a direct result of that, I've stopped all further development of
etnaviv drm, and I'm intending to strip it out from my Xorg DDX driver as the etnaviv drm API which Christian wants is completely incompatible with the non-etnaviv drm, and that just creates far too much pain in the DDX driver.
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
On Tue, Feb 3, 2015 at 2:48 AM, Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
Hm, what does per-process/context page tables have to do here? At least on i915 we have a two levels of page tables:
- first level for vm/device isolation, used through dma api
- 2nd level for per-gpu-context isolation and context switching, handled internally.
Since atm the dma api doesn't have any context of contexts or different pagetables, I don't see who you could use that at all.
Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to drop use of dma-mapping entirely (incl the current call to dma_map_sg, which I just need until we can use drm_cflush on arm), and attach/detach iommu domains directly to implement context switches. At that point, dma_addr_t really has no sensible meaning for me.
BR, -R
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
On Tue, Feb 3, 2015 at 2:48 AM, Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Feb 02, 2015 at 03:30:21PM -0500, Rob Clark wrote:
On Mon, Feb 2, 2015 at 11:54 AM, Daniel Vetter daniel@ffwll.ch wrote:
My initial thought is for dma-buf to not try to prevent something than an exporter can actually do.. I think the scenario you describe could be handled by two sg-lists, if the exporter was clever enough.
That's already needed, each attachment has it's own sg-list. After all there's no array of dma_addr_t in the sg tables, so you can't use one sg for more than one mapping. And due to different iommu different devices can easily end up with different addresses.
Well, to be fair it may not be explicitly stated, but currently one should assume the dma_addr_t's in the dmabuf sglist are bogus. With gpu's that implement per-process/context page tables, I'm not really sure that there is a sane way to actually do anything else..
Hm, what does per-process/context page tables have to do here? At least on i915 we have a two levels of page tables:
- first level for vm/device isolation, used through dma api
- 2nd level for per-gpu-context isolation and context switching, handled internally.
Since atm the dma api doesn't have any context of contexts or different pagetables, I don't see who you could use that at all.
Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to drop use of dma-mapping entirely (incl the current call to dma_map_sg, which I just need until we can use drm_cflush on arm), and attach/detach iommu domains directly to implement context switches. At that point, dma_addr_t really has no sensible meaning for me.
I think what you see here is a quite common hardware setup and we really lack the right abstraction for it at the moment. Everybody seems to work around it with a mix of the dma-mapping API and the iommu API. These are doing different things, and even though the dma-mapping API can be implemented on top of the iommu API, they are not really compatible.
The drm_clflush helpers don't seem like the right solution to me, because all other devices outside of drm will face the same issue, and I suspect we should fill the missing gaps in the API in a more generic way.
Arnd
On Tue, Feb 03, 2015 at 03:17:27PM +0100, Arnd Bergmann wrote:
On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to drop use of dma-mapping entirely (incl the current call to dma_map_sg, which I just need until we can use drm_cflush on arm), and attach/detach iommu domains directly to implement context switches. At that point, dma_addr_t really has no sensible meaning for me.
I think what you see here is a quite common hardware setup and we really lack the right abstraction for it at the moment. Everybody seems to work around it with a mix of the dma-mapping API and the iommu API. These are doing different things, and even though the dma-mapping API can be implemented on top of the iommu API, they are not really compatible.
I'd go as far as saying that the "DMA API on top of IOMMU" is more intended to be for a system IOMMU for the bus in question, rather than a device-level IOMMU.
If an IOMMU is part of a device, then the device should handle it (maybe via an abstraction) and not via the DMA API. The DMA API should be handing the bus addresses to the device driver which the device's IOMMU would need to generate. (In other words, in this circumstance, the DMA API shouldn't give you the device internal address.)
On Tuesday 03 February 2015 14:41:09 Russell King - ARM Linux wrote:
On Tue, Feb 03, 2015 at 03:17:27PM +0100, Arnd Bergmann wrote:
On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to drop use of dma-mapping entirely (incl the current call to dma_map_sg, which I just need until we can use drm_cflush on arm), and attach/detach iommu domains directly to implement context switches. At that point, dma_addr_t really has no sensible meaning for me.
I think what you see here is a quite common hardware setup and we really lack the right abstraction for it at the moment. Everybody seems to work around it with a mix of the dma-mapping API and the iommu API. These are doing different things, and even though the dma-mapping API can be implemented on top of the iommu API, they are not really compatible.
I'd go as far as saying that the "DMA API on top of IOMMU" is more intended to be for a system IOMMU for the bus in question, rather than a device-level IOMMU.
If an IOMMU is part of a device, then the device should handle it (maybe via an abstraction) and not via the DMA API. The DMA API should be handing the bus addresses to the device driver which the device's IOMMU would need to generate. (In other words, in this circumstance, the DMA API shouldn't give you the device internal address.)
Exactly. And the abstraction that people choose at the moment is the iommu API, for better or worse. It makes a lot of sense to use this API if the same iommu is used for other devices as well (which is the case on Tegra and probably a lot of others). Unfortunately the iommu API lacks support for cache management, and probably other things as well, because this was not an issue for the original use case (device assignment on KVM/x86).
This could be done by adding explicit or implied cache management to the IOMMU mapping interfaces, or by extending the dma-mapping interfaces in a way that covers the use case of the device managing its own address space, in addition to the existing coherent and streaming interfaces.
Arnd
On Tue, Feb 03, 2015 at 03:52:48PM +0100, Arnd Bergmann wrote:
On Tuesday 03 February 2015 14:41:09 Russell King - ARM Linux wrote:
I'd go as far as saying that the "DMA API on top of IOMMU" is more intended to be for a system IOMMU for the bus in question, rather than a device-level IOMMU.
If an IOMMU is part of a device, then the device should handle it (maybe via an abstraction) and not via the DMA API. The DMA API should be handing the bus addresses to the device driver which the device's IOMMU would need to generate. (In other words, in this circumstance, the DMA API shouldn't give you the device internal address.)
Exactly. And the abstraction that people choose at the moment is the iommu API, for better or worse. It makes a lot of sense to use this API if the same iommu is used for other devices as well (which is the case on Tegra and probably a lot of others). Unfortunately the iommu API lacks support for cache management, and probably other things as well, because this was not an issue for the original use case (device assignment on KVM/x86).
This could be done by adding explicit or implied cache management to the IOMMU mapping interfaces, or by extending the dma-mapping interfaces in a way that covers the use case of the device managing its own address space, in addition to the existing coherent and streaming interfaces.
Don't we already have those in the DMA API? dma_sync_*() ?
dma_map_sg() - sets up the system MMU and deals with initial cache coherency handling. Device IOMMU being the responsibility of the GPU driver.
The GPU can then do dma_sync_*() on the scatterlist as is necessary to synchronise the cache coherency (while respecting the ownership rules - which are very important on ARM to follow as some sync()s are destructive to any dirty data in the CPU cache.)
dma_unmap_sg() tears down the system MMU and deals with the final cache handling.
Why do we need more DMA API interfaces?
On Tuesday 03 February 2015 15:22:05 Russell King - ARM Linux wrote:
On Tue, Feb 03, 2015 at 03:52:48PM +0100, Arnd Bergmann wrote:
On Tuesday 03 February 2015 14:41:09 Russell King - ARM Linux wrote:
I'd go as far as saying that the "DMA API on top of IOMMU" is more intended to be for a system IOMMU for the bus in question, rather than a device-level IOMMU.
If an IOMMU is part of a device, then the device should handle it (maybe via an abstraction) and not via the DMA API. The DMA API should be handing the bus addresses to the device driver which the device's IOMMU would need to generate. (In other words, in this circumstance, the DMA API shouldn't give you the device internal address.)
Exactly. And the abstraction that people choose at the moment is the iommu API, for better or worse. It makes a lot of sense to use this API if the same iommu is used for other devices as well (which is the case on Tegra and probably a lot of others). Unfortunately the iommu API lacks support for cache management, and probably other things as well, because this was not an issue for the original use case (device assignment on KVM/x86).
This could be done by adding explicit or implied cache management to the IOMMU mapping interfaces, or by extending the dma-mapping interfaces in a way that covers the use case of the device managing its own address space, in addition to the existing coherent and streaming interfaces.
Don't we already have those in the DMA API? dma_sync_*() ?
dma_map_sg() - sets up the system MMU and deals with initial cache coherency handling. Device IOMMU being the responsibility of the GPU driver.
dma_sync_*() works with whatever comes out of dma_map_*(), true, but this is not what they want to do here.
The GPU can then do dma_sync_*() on the scatterlist as is necessary to synchronise the cache coherency (while respecting the ownership rules - which are very important on ARM to follow as some sync()s are destructive to any dirty data in the CPU cache.)
dma_unmap_sg() tears down the system MMU and deals with the final cache handling.
Why do we need more DMA API interfaces?
The dma_map_* interfaces assign the virtual addresses internally, using typically either a global address space for all devices, or one address space per device.
There are multiple things that this cannot do, and that is why the drivers use the iommu API directly:
- use one address space per 'struct mm' - map user memory with bus_address == user_address - map memory into the GPU without having a permanent kernel mapping - map memory first, and do the initial cache flushes later
Arnd
On Tue, Feb 03, 2015 at 04:31:13PM +0100, Arnd Bergmann wrote:
On Tuesday 03 February 2015 15:22:05 Russell King - ARM Linux wrote:
Don't we already have those in the DMA API? dma_sync_*() ?
dma_map_sg() - sets up the system MMU and deals with initial cache coherency handling. Device IOMMU being the responsibility of the GPU driver.
dma_sync_*() works with whatever comes out of dma_map_*(), true, but this is not what they want to do here.
The GPU can then do dma_sync_*() on the scatterlist as is necessary to synchronise the cache coherency (while respecting the ownership rules - which are very important on ARM to follow as some sync()s are destructive to any dirty data in the CPU cache.)
dma_unmap_sg() tears down the system MMU and deals with the final cache handling.
Why do we need more DMA API interfaces?
The dma_map_* interfaces assign the virtual addresses internally, using typically either a global address space for all devices, or one address space per device.
We shouldn't be doing one address space per device for precisely this reason. We should be doing one address space per *bus*. I did have a nice diagram to illustrate the point in my previous email, but I deleted it, I wish I hadn't... briefly:
Fig. 1. +------------------+ |+-----+ device | CPU--L1cache--L2cache--Memory--SysMMU---<iobus>----IOMMU--> | |+-----+ | +------------------+
Fig.1 represents what I'd call the "GPU" issue that we're talking about in this thread.
Fig. 2. CPU--L1cache--L2cache--Memory--SysMMU---<iobus>--IOMMU--device
The DMA API should be responsible (at the very least) for everything on the left of "<iobus>" in and should be providing a dma_addr_t which is representative of what the device (in Fig.1) as a whole sees. That's the "system" part.
I believe this is the approach which is taken by x86 and similar platforms, simply because they tend not to have an IOMMU on individual devices (and if they did, eg, on a PCI card, it's clearly the responsibility of the device driver.)
Whether the DMA API also handles the IOMMU in Fig.1 or 2 is questionable. For fig.2, it is entirely possible that the same device could appear without an IOMMU, and in that scenario, you would want the IOMMU to be handled transparently.
However, by doing so for everything, you run into exactly the problem which is being discussed here - the need to separate out the cache coherency from the IOMMU aspects. You probably also have a setup very similar to fig.1 (which is certainly true of Vivante GPUs.)
If you have the need to separately control both, then using the DMA API to encapsulate both does not make sense - at which point, the DMA API should be responsible for the minimum only - in other words, everything to the left of <iobus> (so including the system MMU.) The control of the device IOMMU should be the responsibility of device driver in this case.
So, dma_map_sg() would be responsible for dealing with the CPU cache coherency issues, and setting up the system MMU. dma_sync_*() would be responsible for the CPU cache coherency issues, and dma_unmap_sg() would (again) deal with the CPU cache and tear down the system MMU mappings.
Meanwhile, the device driver has ultimate control over its IOMMU, the creation and destruction of mappings and context switches at the appropriate times.
On Tuesday 03 February 2015 15:54:04 Russell King - ARM Linux wrote:
On Tue, Feb 03, 2015 at 04:31:13PM +0100, Arnd Bergmann wrote:
The dma_map_* interfaces assign the virtual addresses internally, using typically either a global address space for all devices, or one address space per device.
We shouldn't be doing one address space per device for precisely this reason. We should be doing one address space per *bus*. I did have a nice diagram to illustrate the point in my previous email, but I deleted it, I wish I hadn't... briefly:
Fig. 1. +------------------+ |+-----+ device | CPU--L1cache--L2cache--Memory--SysMMU---<iobus>----IOMMU--> | |+-----+ | +------------------+
Fig.1 represents what I'd call the "GPU" issue that we're talking about in this thread.
Fig. 2. CPU--L1cache--L2cache--Memory--SysMMU---<iobus>--IOMMU--device
The DMA API should be responsible (at the very least) for everything on the left of "<iobus>" in and should be providing a dma_addr_t which is representative of what the device (in Fig.1) as a whole sees. That's the "system" part.
I believe this is the approach which is taken by x86 and similar platforms, simply because they tend not to have an IOMMU on individual devices (and if they did, eg, on a PCI card, it's clearly the responsibility of the device driver.)
Whether the DMA API also handles the IOMMU in Fig.1 or 2 is questionable. For fig.2, it is entirely possible that the same device could appear without an IOMMU, and in that scenario, you would want the IOMMU to be handled transparently.
However, by doing so for everything, you run into exactly the problem which is being discussed here - the need to separate out the cache coherency from the IOMMU aspects. You probably also have a setup very similar to fig.1 (which is certainly true of Vivante GPUs.)
If you have the need to separately control both, then using the DMA API to encapsulate both does not make sense - at which point, the DMA API should be responsible for the minimum only - in other words, everything to the left of <iobus> (so including the system MMU.) The control of the device IOMMU should be the responsibility of device driver in this case.
So, dma_map_sg() would be responsible for dealing with the CPU cache coherency issues, and setting up the system MMU. dma_sync_*() would be responsible for the CPU cache coherency issues, and dma_unmap_sg() would (again) deal with the CPU cache and tear down the system MMU mappings.
Meanwhile, the device driver has ultimate control over its IOMMU, the creation and destruction of mappings and context switches at the appropriate times.
I agree for the case you are describing here. From what I understood from Rob was that he is looking at something more like:
Fig 3 CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
where the IOMMU controls one or more contexts per device, and is shared across GPU and non-GPU devices. Here, we need to use the dmap-mapping interface to set up the IO page table for any device that is unable to address all of system RAM, and we can use it for purposes like isolation of the devices. There are also cases where using the IOMMU is not optional.
So unlike the scenario you describe, the driver cannot at the same time control the cache (using the dma-mapping API) and the I/O page tables (using the iommu API or some internal functions).
Arnd
On Tue, Feb 3, 2015 at 11:12 AM, Arnd Bergmann arnd@arndb.de wrote:
I agree for the case you are describing here. From what I understood from Rob was that he is looking at something more like:
Fig 3 CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
where the IOMMU controls one or more contexts per device, and is shared across GPU and non-GPU devices. Here, we need to use the dmap-mapping interface to set up the IO page table for any device that is unable to address all of system RAM, and we can use it for purposes like isolation of the devices. There are also cases where using the IOMMU is not optional.
Actually, just to clarify, the IOMMU instance is specific to the GPU.. not shared with other devices. Otherwise managing multiple contexts would go quite badly..
But other devices have their own instance of the same IOMMU.. so same driver could be used.
BR, -R
On Tuesday 03 February 2015 11:22:01 Rob Clark wrote:
On Tue, Feb 3, 2015 at 11:12 AM, Arnd Bergmann arnd@arndb.de wrote:
I agree for the case you are describing here. From what I understood from Rob was that he is looking at something more like:
Fig 3 CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
where the IOMMU controls one or more contexts per device, and is shared across GPU and non-GPU devices. Here, we need to use the dmap-mapping interface to set up the IO page table for any device that is unable to address all of system RAM, and we can use it for purposes like isolation of the devices. There are also cases where using the IOMMU is not optional.
Actually, just to clarify, the IOMMU instance is specific to the GPU.. not shared with other devices. Otherwise managing multiple contexts would go quite badly..
But other devices have their own instance of the same IOMMU.. so same driver could be used.
I think from the driver perspective, I'd view those two cases as identical. Not sure if Russell agrees with that.
Arnd
On Tue, Feb 03, 2015 at 05:36:59PM +0100, Arnd Bergmann wrote:
On Tuesday 03 February 2015 11:22:01 Rob Clark wrote:
On Tue, Feb 3, 2015 at 11:12 AM, Arnd Bergmann arnd@arndb.de wrote:
I agree for the case you are describing here. From what I understood from Rob was that he is looking at something more like:
Fig 3 CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
where the IOMMU controls one or more contexts per device, and is shared across GPU and non-GPU devices. Here, we need to use the dmap-mapping interface to set up the IO page table for any device that is unable to address all of system RAM, and we can use it for purposes like isolation of the devices. There are also cases where using the IOMMU is not optional.
Actually, just to clarify, the IOMMU instance is specific to the GPU.. not shared with other devices. Otherwise managing multiple contexts would go quite badly..
But other devices have their own instance of the same IOMMU.. so same driver could be used.
I think from the driver perspective, I'd view those two cases as identical. Not sure if Russell agrees with that.
Imo whether the iommu is private to the device and required for gpu functionality like context switching or shared across a bunch of devices is fairly important. Assuming I understand this discussion correctly we have two different things pulling in opposite directions:
- From a gpu functionality perspective we want to give the gpu driver full control over the device-private iommu, pushing it out of the control of the dma api. dma_map_sg would just map to whatever bus addresses that iommu would need to use for generating access cycles.
This is the design used by every gpu driver we have in upstream thus far (where you always have some on-gpu iommu/pagetable walker thing), on top of whatever system iommu that might be there or not (which is then managed by the dma apis).
- On many soc people love to reuse iommus with the same or similar interface all over the place. The solution thus far adopted on arm platforms is to write an iommu driver for those and then implement the dma-api on top of this iommu.
But if we unconditionally do this then we rob the gpu driver's ability to control its private iommu like it wants to, because a lot of the functionality is lost behind the dma api abstraction.
Again assuming I'm not confused can't we just solve this by pushing the dma api abstraction down one layer for just the gpu, and let it use its private iommmu directly? Steps for binding a buffer would be: 1. dma_map_sg 2. Noodle the dma_addr_t out of the sg table and feed those into a 2nd level mapping set up through the iommu api for the gpu-private mmu.
Again, this is what i915 and all the ttm based drivers already do, except that we don't use the generic iommu interfaces but have our own (i915 has its interface in i915_gem_gtt.c, ttm just calls them tt for translation tables ...).
Cheers, Daniel
On Tuesday 03 February 2015 21:04:35 Daniel Vetter wrote:
On many soc people love to reuse iommus with the same or similar interface all over the place. The solution thus far adopted on arm platforms is to write an iommu driver for those and then implement the dma-api on top of this iommu.
But if we unconditionally do this then we rob the gpu driver's ability to control its private iommu like it wants to, because a lot of the functionality is lost behind the dma api abstraction.
I believe in case of rockchips, msm, exynos and tegra, the same iommu is used directly by the DRM driver while it is used implicitly by the dma-mapping API. We have run into some problems with this in 3.19, but they should be solvable.
Again assuming I'm not confused can't we just solve this by pushing the dma api abstraction down one layer for just the gpu, and let it use its private iommmu directly? Steps for binding a buffer would be:
- dma_map_sg
- Noodle the dma_addr_t out of the sg table and feed those into a 2nd
level mapping set up through the iommu api for the gpu-private mmu.
If you want to do that, you run into the problem of telling the driver core about it. We associate the device with an iommu in the device tree, describing there how it is wired up.
The driver core creates a platform_device for this and checks if it an iommu mapping is required or wanted for the device, which is then set up. When the device driver wants to create its own iommu mapping, this conflicts with the one that is already there. We can't just skip the iommu setup for all devices because it may be needed sometimes, and I don't really want to see hacks where the driver core knows which devices are GPUs and skips the mapping for them, which would be a layering violation.
Again, this is what i915 and all the ttm based drivers already do, except that we don't use the generic iommu interfaces but have our own (i915 has its interface in i915_gem_gtt.c, ttm just calls them tt for translation tables ...).
Right, if you have a private iommu, there is no problem. The tricky part is using a single driver for the system-level translation and the gpu private mappings when there is only one type of iommu in the system.
Arnd
On Tue, Feb 3, 2015 at 10:42 PM, Arnd Bergmann arnd@arndb.de wrote:
Again assuming I'm not confused can't we just solve this by pushing the dma api abstraction down one layer for just the gpu, and let it use its private iommmu directly? Steps for binding a buffer would be:
- dma_map_sg
- Noodle the dma_addr_t out of the sg table and feed those into a 2nd
level mapping set up through the iommu api for the gpu-private mmu.
If you want to do that, you run into the problem of telling the driver core about it. We associate the device with an iommu in the device tree, describing there how it is wired up.
The driver core creates a platform_device for this and checks if it an iommu mapping is required or wanted for the device, which is then set up. When the device driver wants to create its own iommu mapping, this conflicts with the one that is already there. We can't just skip the iommu setup for all devices because it may be needed sometimes, and I don't really want to see hacks where the driver core knows which devices are GPUs and skips the mapping for them, which would be a layering violation.
I don't think you get a choice but to make gpus a special case. There's a bunch of cases why the iommu private to the gpu is special: - If there's gpu-private iommu at all you have a nice security problem, and you must scan your cmd stream to make sure no gpu access goes to arbitrary system memory. We kinda consider isolation between clients optional, but isolation to everything else is mandatory. And scanning the cmd stream in software has such big implications on the design of your driver that you essentially need 2 different drivers. Even if the IP block otherwise matches. - If your iommu supports multiple address space then the gpu must know. We've already covered this case.
So trying to wedge the dma api between the gpu and its private iommu is imo the layering violation here. Imo the dma api only should control an iommu for the gpu if: - the iommu is shared (so can't be used for isolation and you need the full blwon cmd scanner) - it's a 2nd level iommu (e.g. what we have on i915) and there is another private iommu.
Note that with private I only mean no other device can use it, I don't mean whether it's on the same IP block or not (we even have an iommu abstraction in i915 because the pagetable walkers are pretty much separate from everything else and evolve mostly independently). -Daniel
On Tue, Feb 03, 2015 at 10:42:26PM +0100, Arnd Bergmann wrote:
Right, if you have a private iommu, there is no problem. The tricky part is using a single driver for the system-level translation and the gpu private mappings when there is only one type of iommu in the system.
You've got a problem anyway with this approach. If you look at my figure 2 and apply it to this scenario, you have two MMUs stacked on top of each other. That's something that (afaik) we don't support, but it's entirely possible that will come along with ARM64.
It may not be nice to have to treat GPUs specially, but I think we really do need to, and forget the idea that the GPU's IOMMU (as opposed to a system MMU) should appear in a generic form in DT.
On Tue, Feb 03, 2015 at 11:22:01AM -0500, Rob Clark wrote:
On Tue, Feb 3, 2015 at 11:12 AM, Arnd Bergmann arnd@arndb.de wrote:
I agree for the case you are describing here. From what I understood from Rob was that he is looking at something more like:
Fig 3 CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
where the IOMMU controls one or more contexts per device, and is shared across GPU and non-GPU devices. Here, we need to use the dmap-mapping interface to set up the IO page table for any device that is unable to address all of system RAM, and we can use it for purposes like isolation of the devices. There are also cases where using the IOMMU is not optional.
Actually, just to clarify, the IOMMU instance is specific to the GPU.. not shared with other devices. Otherwise managing multiple contexts would go quite badly..
But other devices have their own instance of the same IOMMU.. so same driver could be used.
Okay, so that is my Fig.2 case, and we don't have to worry about Fig.3.
One thing I forgot in Fig.1/2 which my original did have were to mark the system MMU as optional. (Think an ARM64 with SMMU into a 32-bit peripheral bus.) Do we support stacked MMUs in the DMA API? We may need to if we keep IOMMUs in the DMA API.
On Tue, Feb 03, 2015 at 05:12:40PM +0100, Arnd Bergmann wrote:
On Tuesday 03 February 2015 15:54:04 Russell King - ARM Linux wrote:
On Tue, Feb 03, 2015 at 04:31:13PM +0100, Arnd Bergmann wrote:
The dma_map_* interfaces assign the virtual addresses internally, using typically either a global address space for all devices, or one address space per device.
We shouldn't be doing one address space per device for precisely this reason. We should be doing one address space per *bus*. I did have a nice diagram to illustrate the point in my previous email, but I deleted it, I wish I hadn't... briefly:
Fig. 1. +------------------+ |+-----+ device | CPU--L1cache--L2cache--Memory--SysMMU---<iobus>----IOMMU--> | |+-----+ | +------------------+
Fig.1 represents what I'd call the "GPU" issue that we're talking about in this thread.
Fig. 2. CPU--L1cache--L2cache--Memory--SysMMU---<iobus>--IOMMU--device
The DMA API should be responsible (at the very least) for everything on the left of "<iobus>" in and should be providing a dma_addr_t which is representative of what the device (in Fig.1) as a whole sees. That's the "system" part.
I believe this is the approach which is taken by x86 and similar platforms, simply because they tend not to have an IOMMU on individual devices (and if they did, eg, on a PCI card, it's clearly the responsibility of the device driver.)
Whether the DMA API also handles the IOMMU in Fig.1 or 2 is questionable. For fig.2, it is entirely possible that the same device could appear without an IOMMU, and in that scenario, you would want the IOMMU to be handled transparently.
However, by doing so for everything, you run into exactly the problem which is being discussed here - the need to separate out the cache coherency from the IOMMU aspects. You probably also have a setup very similar to fig.1 (which is certainly true of Vivante GPUs.)
If you have the need to separately control both, then using the DMA API to encapsulate both does not make sense - at which point, the DMA API should be responsible for the minimum only - in other words, everything to the left of <iobus> (so including the system MMU.) The control of the device IOMMU should be the responsibility of device driver in this case.
So, dma_map_sg() would be responsible for dealing with the CPU cache coherency issues, and setting up the system MMU. dma_sync_*() would be responsible for the CPU cache coherency issues, and dma_unmap_sg() would (again) deal with the CPU cache and tear down the system MMU mappings.
Meanwhile, the device driver has ultimate control over its IOMMU, the creation and destruction of mappings and context switches at the appropriate times.
I agree for the case you are describing here. From what I understood from Rob was that he is looking at something more like:
Fig 3 CPU--L1cache--L2cache--Memory--IOMMU---<iobus>--device
where the IOMMU controls one or more contexts per device, and is shared across GPU and non-GPU devices. Here, we need to use the dmap-mapping interface to set up the IO page table for any device that is unable to address all of system RAM, and we can use it for purposes like isolation of the devices. There are also cases where using the IOMMU is not optional.
Okay, but switching contexts is not something which the DMA API has any knowledge of (so it can't know which context to associate with which mapping.) While it knows which device, it has no knowledge (nor is there any way for it to gain knowledge) about contexts.
My personal view is that extending the DMA API in this way feels quite dirty - it's a violation of the DMA API design, which is to (a) demark the buffer ownership between CPU and DMA agent, and (b) to translate buffer locations into a cookie which device drivers can use to instruct their device to access that memory. To see why, consider... that you map a buffer to a device in context A, and then you switch to context B, which means the dma_addr_t given previously is no longer valid. You then try to unmap it... which is normally done using the (now no longer valid) dma_addr_t.
It seems to me that to support this at DMA API level, we would need to completely revamp the DMA API, which IMHO isn't going to be nice. (It would mean that we end up with three APIs - the original PCI DMA API, the existing DMA API, and some new DMA API.)
Do we have any views on how common this feature is?
On Tue, Feb 3, 2015 at 11:58 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
Okay, but switching contexts is not something which the DMA API has any knowledge of (so it can't know which context to associate with which mapping.) While it knows which device, it has no knowledge (nor is there any way for it to gain knowledge) about contexts.
My personal view is that extending the DMA API in this way feels quite dirty - it's a violation of the DMA API design, which is to (a) demark the buffer ownership between CPU and DMA agent, and (b) to translate buffer locations into a cookie which device drivers can use to instruct their device to access that memory. To see why, consider... that you map a buffer to a device in context A, and then you switch to context B, which means the dma_addr_t given previously is no longer valid. You then try to unmap it... which is normally done using the (now no longer valid) dma_addr_t.
It seems to me that to support this at DMA API level, we would need to completely revamp the DMA API, which IMHO isn't going to be nice. (It would mean that we end up with three APIs - the original PCI DMA API, the existing DMA API, and some new DMA API.)
Do we have any views on how common this feature is?
I can't think of cases outside of GPU's.. if it were more common I'd be in favor of teaching dma api about multiple contexts, but right now I think that would just amount to forcing a lot of churn on everyone else for the benefit of GPU's.
IMHO it makes more sense for GPU drivers to bypass the dma api if they need to. Plus, sooner or later, someone will discover that with some trick or optimization they can get moar fps, but the extra layer of abstraction will just be getting in the way.
BR, -R
On Tue, Feb 03, 2015 at 12:35:34PM -0500, Rob Clark wrote:
On Tue, Feb 3, 2015 at 11:58 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
Okay, but switching contexts is not something which the DMA API has any knowledge of (so it can't know which context to associate with which mapping.) While it knows which device, it has no knowledge (nor is there any way for it to gain knowledge) about contexts.
My personal view is that extending the DMA API in this way feels quite dirty - it's a violation of the DMA API design, which is to (a) demark the buffer ownership between CPU and DMA agent, and (b) to translate buffer locations into a cookie which device drivers can use to instruct their device to access that memory. To see why, consider... that you map a buffer to a device in context A, and then you switch to context B, which means the dma_addr_t given previously is no longer valid. You then try to unmap it... which is normally done using the (now no longer valid) dma_addr_t.
It seems to me that to support this at DMA API level, we would need to completely revamp the DMA API, which IMHO isn't going to be nice. (It would mean that we end up with three APIs - the original PCI DMA API, the existing DMA API, and some new DMA API.)
Do we have any views on how common this feature is?
I can't think of cases outside of GPU's.. if it were more common I'd be in favor of teaching dma api about multiple contexts, but right now I think that would just amount to forcing a lot of churn on everyone else for the benefit of GPU's.
IMHO it makes more sense for GPU drivers to bypass the dma api if they need to. Plus, sooner or later, someone will discover that with some trick or optimization they can get moar fps, but the extra layer of abstraction will just be getting in the way.
See my other reply, but all existing full-blown drivers don't bypass the dma api. Instead it's just a two-level scheme: 1. First level is dma api. Might or might not contain a system iommu. 2. 2nd level is the gpu-private iommu which is also used for per context address spaces. Thus far all drivers just rolled their own drivers for this (it's kinda fused to the chips on x86 hw anyway), but it looks like using the iommu api gives us a somewhat suitable abstraction for code sharing.
Imo you need both, otherwise we start leaking stuff like cpu cache flushing all over the place. Looking at i915 (where the dma api assumes that everything is coherent, which is kinda not the case) that won't be pretty. And there's still the issue that you might nest a system iommu and a 2nd level iommu for per-context pagetables (this is real and what's going on right now on intel hw). -Daniel
On Tuesday 03 February 2015 12:35:34 Rob Clark wrote:
I can't think of cases outside of GPU's.. if it were more common I'd be in favor of teaching dma api about multiple contexts, but right now I think that would just amount to forcing a lot of churn on everyone else for the benefit of GPU's.
We have a couple of users of the iommu API at the moment outside of GPUs:
* drivers/media/platform/omap3isp/isp.c * drivers/remoteproc/remoteproc_core.c * drivers/infiniband/hw/usnic/usnic_uiom.c * drivers/vfio/
I assume we will see a few more over time. The vfio case is the most important one here, since that is what the iommu API was designed for.
Arnd
On Tue, Feb 3, 2015 at 9:52 AM, Arnd Bergmann arnd@arndb.de wrote:
On Tuesday 03 February 2015 14:41:09 Russell King - ARM Linux wrote:
On Tue, Feb 03, 2015 at 03:17:27PM +0100, Arnd Bergmann wrote:
On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to drop use of dma-mapping entirely (incl the current call to dma_map_sg, which I just need until we can use drm_cflush on arm), and attach/detach iommu domains directly to implement context switches. At that point, dma_addr_t really has no sensible meaning for me.
I think what you see here is a quite common hardware setup and we really lack the right abstraction for it at the moment. Everybody seems to work around it with a mix of the dma-mapping API and the iommu API. These are doing different things, and even though the dma-mapping API can be implemented on top of the iommu API, they are not really compatible.
I'd go as far as saying that the "DMA API on top of IOMMU" is more intended to be for a system IOMMU for the bus in question, rather than a device-level IOMMU.
If an IOMMU is part of a device, then the device should handle it (maybe via an abstraction) and not via the DMA API. The DMA API should be handing the bus addresses to the device driver which the device's IOMMU would need to generate. (In other words, in this circumstance, the DMA API shouldn't give you the device internal address.)
Exactly. And the abstraction that people choose at the moment is the iommu API, for better or worse. It makes a lot of sense to use this API if the same iommu is used for other devices as well (which is the case on Tegra and probably a lot of others). Unfortunately the iommu API lacks support for cache management, and probably other things as well, because this was not an issue for the original use case (device assignment on KVM/x86).
This could be done by adding explicit or implied cache management to the IOMMU mapping interfaces, or by extending the dma-mapping interfaces in a way that covers the use case of the device managing its own address space, in addition to the existing coherent and streaming interfaces.
I think for gpu's, we'd prefer explicit and less abstraction.. which is probably opposite of what every other driver would want
In the end, my eventual goal is explicit control of tlb flush, and control of my address space. And in fact in some cases we are going to want to use the gpu to bang on iommu registers to do context switches and tlb flushes. (Which is obviously not the first step.. and something that is fairly difficult to get right/secure.. but the performance win seems significant so I'm not sure we can avoid it.)
BR, -R
Arnd
On Tue, Feb 3, 2015 at 9:41 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Tue, Feb 03, 2015 at 03:17:27PM +0100, Arnd Bergmann wrote:
On Tuesday 03 February 2015 09:04:03 Rob Clark wrote:
Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to drop use of dma-mapping entirely (incl the current call to dma_map_sg, which I just need until we can use drm_cflush on arm), and attach/detach iommu domains directly to implement context switches. At that point, dma_addr_t really has no sensible meaning for me.
I think what you see here is a quite common hardware setup and we really lack the right abstraction for it at the moment. Everybody seems to work around it with a mix of the dma-mapping API and the iommu API. These are doing different things, and even though the dma-mapping API can be implemented on top of the iommu API, they are not really compatible.
I'd go as far as saying that the "DMA API on top of IOMMU" is more intended to be for a system IOMMU for the bus in question, rather than a device-level IOMMU.
If an IOMMU is part of a device, then the device should handle it (maybe via an abstraction) and not via the DMA API. The DMA API should be handing the bus addresses to the device driver which the device's IOMMU would need to generate. (In other words, in this circumstance, the DMA API shouldn't give you the device internal address.)
if the dma_addr_t becomes the address upstream of the iommu (in practice, the phys addr), that would, I think, address my concerns about dma_addr_t
BR, -R
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
On Tue, Feb 03, 2015 at 09:04:03AM -0500, Rob Clark wrote:
Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to drop use of dma-mapping entirely (incl the current call to dma_map_sg, which I just need until we can use drm_cflush on arm), and attach/detach iommu domains directly to implement context switches. At that point, dma_addr_t really has no sensible meaning for me.
So how do you intend to import from a subsystem which only gives you the dma_addr_t?
If you aren't passing system memory, you have no struct page. You can't fake up a struct page. What this means is that struct scatterlist can't represent it any other way.
On Tue, Feb 3, 2015 at 9:37 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Tue, Feb 03, 2015 at 09:04:03AM -0500, Rob Clark wrote:
Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to drop use of dma-mapping entirely (incl the current call to dma_map_sg, which I just need until we can use drm_cflush on arm), and attach/detach iommu domains directly to implement context switches. At that point, dma_addr_t really has no sensible meaning for me.
So how do you intend to import from a subsystem which only gives you the dma_addr_t?
If you aren't passing system memory, you have no struct page. You can't fake up a struct page. What this means is that struct scatterlist can't represent it any other way.
Tell the exporter to stop using carveouts, and give me proper memory instead.. ;-)
Well, at least on these SoC's, I think the only valid use for carveout memory is the bootloader splashscreen. And I was planning on just hanging on to that for myself for fbdev scanout buffer or other internal (non shared) usage..
BR, -R
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
On Tue, Feb 03, 2015 at 09:44:57AM -0500, Rob Clark wrote:
On Tue, Feb 3, 2015 at 9:37 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Tue, Feb 03, 2015 at 09:04:03AM -0500, Rob Clark wrote:
Since I'm stuck w/ an iommu, instead of built in mmu, my plan was to drop use of dma-mapping entirely (incl the current call to dma_map_sg, which I just need until we can use drm_cflush on arm), and attach/detach iommu domains directly to implement context switches. At that point, dma_addr_t really has no sensible meaning for me.
So how do you intend to import from a subsystem which only gives you the dma_addr_t?
If you aren't passing system memory, you have no struct page. You can't fake up a struct page. What this means is that struct scatterlist can't represent it any other way.
Tell the exporter to stop using carveouts, and give me proper memory instead.. ;-)
Well, at least on these SoC's, I think the only valid use for carveout memory is the bootloader splashscreen. And I was planning on just hanging on to that for myself for fbdev scanout buffer or other internal (non shared) usage..
I wasn't thinking about carveouts - as I already mentioned earlier in this thread, it may be memory which couldn't possibly ever be system memory - for example, a separate chunk of memory which is tightly coupled to the graphics system but not so to the CPU.
In such a case, we wouldn't want to use that as normal system memory, but we would want to allocate framebuffers and the like from it, and maybe pass them around.
While it may not be appropriate for MSM, it's still something that needs to be considered, because there may be (and I know there are) dmabuf users which do pass memory this way.
So, what I'm saying is that for the purposes of the dmabuf API, we can't mandate that the scatterlists will contain a valid struct page pointer. It'd probably be a good idea for the importer to validate the scatterlist at import time if it has this requirement.
However, thinking about this more, I think that from a generic design point of view, we really should limit the "struct page" usage to a special MSM-ism - something which should definitely not be copied by other drivers. As has been mentioned previously, if there is a system MMU which needs to be programmed to map system memory onto the bus, the struct page becomes absolutely useless, and the only thing that gives you the correct "handle" to that memory is the dma_addr_t.
Finally, note that n_mapped = dma_map_sg(dev, sg, n_ent, dir) - n_mapped can be less than n_ent when there's the presence of an IOMMU, since an IOMMU is permitted to coalesce individual scatterlist entries if it so chooses, and when walking the scatterlist for DMA purposes, the scatterlist sg_dma_*() accessors should be used, and it should be iterated over from 0 to n_mapped, not 0 to n_ent. It's important to realise that in driver code, sg->length may not be the same as sg_dma_len(sg) for exactly this reason:
#ifdef CONFIG_NEED_SG_DMA_LENGTH #define sg_dma_len(sg) ((sg)->dma_length) #else #define sg_dma_len(sg) ((sg)->length) #endif
Hi Russell,
On 30 January 2015 at 00:56, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Thu, Jan 29, 2015 at 01:52:09PM -0500, Rob Clark wrote:
Quite possibly for some of these edge some of cases, some of the dma-buf exporters are going to need to get more clever (ie. hand off different scatterlists to different clients). Although I think by far the two common cases will be "I can support anything via an iommu/mmu" and "I need phys contig".
But that isn't an issue w/ dma-buf itself, so much as it is an issue w/ drivers. I guess there would be more interest in fixing up drivers when actual hw comes along that needs it..
However, validating the attachments is the business of dma-buf. This is actual infrastructure, which should ensure some kind of sanity such as the issues I've raised.
The whole "we can push it onto our users" is really on - what that results in is the users ignoring most of the requirements and just doing their own thing, which ultimately ends up with the whole thing turning into a disgusting mess - one which becomes very difficult to fix later.
Now, if we're going to do the "more clever" thing you mention above, that rather negates the point of this two-part patch set, which is to provide the union of the DMA capabilities of all users. A union in that case is no longer sane as we'd be tailoring the SG lists to each user.
If we aren't going to do the "more clever" thing, then yes, we need this code to calculate that union, but we _also_ need it to do sanity checking right from the start, and refuse conditions which ultimately break the ability to make use of that union - in other words, when the union of the DMA capabilities means that the dmabuf can't be represented.
Unless we do that, we'll just end up with random drivers interpreting what they want from the DMA capabilities, and we'll have some drivers exporting (eg) scatterlists which satisfy the maximum byte size of an element, but ignoring the maximum number of entries or vice versa, and that'll most probably hide the case of "too small a union".
I agree, and I'll add the check for max_segment_size * max_segment_count < dmabuf->size and resend; will that be alright with you?
It really doesn't make sense to do both either: that route is even more madness, because we'll end up with two classes of drivers - those which use the union approach, and those which don't.
The KISS principle applies here.
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
Best regards, ~Sumit.
Hello,
On 2015-01-27 09:25, Sumit Semwal wrote:
Add some helpers to share the constraints of devices while attaching to the dmabuf buffer.
At each attach, the constraints are calculated based on the following:
- max_segment_size, max_segment_count, segment_boundary_mask from device_dma_parameters.
In case the attaching device's constraints don't match up, attach() fails.
At detach, the constraints are recalculated based on the remaining attached devices.
Two helpers are added:
- dma_buf_get_constraints - which gives the current constraints as calculated during each attach on the buffer till the time,
- dma_buf_recalc_constraints - which recalculates the constraints for all currently attached devices for the 'paranoid' ones amongst us.
The idea of this patch is largely taken from Rob Clark's RFC at https://lkml.org/lkml/2012/7/19/285, and the comments received on it.
Cc: Rob Clark robdclark@gmail.com Signed-off-by: Sumit Semwal sumit.semwal@linaro.org
The code looks okay, although it will probably will work well only with typical cases like 'contiguous memory needed' or 'no constraints at all' (iommu).
Acked-by: Marek Szyprowski m.szyprowski@samsung.com
v3:
- Thanks to Russell's comment, remove dma_mask and coherent_dma_mask from constraints' calculation; has a nice side effect of letting us use device_dma_parameters directly to list constraints.
- update the debugfs output to show constraint info as well.
v2: split constraints-sharing and allocation helpers
drivers/dma-buf/dma-buf.c | 126 +++++++++++++++++++++++++++++++++++++++++++++- include/linux/dma-buf.h | 7 +++ 2 files changed, 132 insertions(+), 1 deletion(-)
diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 5be225c2ba98..f363f1440803 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -264,6 +264,66 @@ static inline int is_dma_buf_file(struct file *file) return file->f_op == &dma_buf_fops; } +static inline void init_constraints(struct device_dma_parameters *cons) +{
- cons->max_segment_count = (unsigned int)-1;
- cons->max_segment_size = (unsigned int)-1;
- cons->segment_boundary_mask = (unsigned long)-1;
+}
+/*
- calc_constraints - calculates if the new attaching device's constraints
- match, with the constraints of already attached devices; if yes, returns
- the constraints; else return ERR_PTR(-EINVAL)
- */
+static int calc_constraints(struct device *dev,
struct device_dma_parameters *calc_cons)
+{
- struct device_dma_parameters cons = *calc_cons;
- cons.max_segment_count = min(cons.max_segment_count,
dma_get_max_seg_count(dev));
- cons.max_segment_size = min(cons.max_segment_size,
dma_get_max_seg_size(dev));
- cons.segment_boundary_mask &= dma_get_seg_boundary(dev);
- if (!cons.max_segment_count ||
!cons.max_segment_size ||
!cons.segment_boundary_mask) {
pr_err("Dev: %s's constraints don't match\n", dev_name(dev));
return -EINVAL;
- }
- *calc_cons = cons;
- return 0;
+}
+/*
- recalc_constraints - recalculates constraints for all attached devices;
- useful for detach() recalculation, and for dma_buf_recalc_constraints()
- helper.
- Returns recalculated constraints in recalc_cons, or error in the unlikely
- case when constraints of attached devices might have changed.
- */
+static int recalc_constraints(struct dma_buf *dmabuf,
struct device_dma_parameters *recalc_cons)
+{
- struct device_dma_parameters calc_cons;
- struct dma_buf_attachment *attach;
- int ret = 0;
- init_constraints(&calc_cons);
- list_for_each_entry(attach, &dmabuf->attachments, node) {
ret = calc_constraints(attach->dev, &calc_cons);
if (ret)
return ret;
- }
- *recalc_cons = calc_cons;
- return 0;
+}
- /**
- dma_buf_export_named - Creates a new dma_buf, and associates an anon file
- with this buffer, so it can be exported.
@@ -313,6 +373,9 @@ struct dma_buf *dma_buf_export_named(void *priv, const struct dma_buf_ops *ops, dmabuf->ops = ops; dmabuf->size = size; dmabuf->exp_name = exp_name;
- init_constraints(&dmabuf->constraints);
- init_waitqueue_head(&dmabuf->poll); dmabuf->cb_excl.poll = dmabuf->cb_shared.poll = &dmabuf->poll; dmabuf->cb_excl.active = dmabuf->cb_shared.active = 0;
@@ -422,7 +485,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, struct device *dev) { struct dma_buf_attachment *attach;
- int ret;
- int ret = 0;
if (WARN_ON(!dmabuf || !dev)) return ERR_PTR(-EINVAL); @@ -436,6 +499,9 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, mutex_lock(&dmabuf->lock);
- if (calc_constraints(dev, &dmabuf->constraints))
goto err_constraints;
- if (dmabuf->ops->attach) { ret = dmabuf->ops->attach(dmabuf, dev, attach); if (ret)
@@ -448,6 +514,7 @@ struct dma_buf_attachment *dma_buf_attach(struct dma_buf *dmabuf, err_attach: kfree(attach); +err_constraints: mutex_unlock(&dmabuf->lock); return ERR_PTR(ret); } @@ -470,6 +537,8 @@ void dma_buf_detach(struct dma_buf *dmabuf, struct dma_buf_attachment *attach) if (dmabuf->ops->detach) dmabuf->ops->detach(dmabuf, attach);
- recalc_constraints(dmabuf, &dmabuf->constraints);
- mutex_unlock(&dmabuf->lock); kfree(attach); }
@@ -770,6 +839,56 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, void *vaddr) } EXPORT_SYMBOL_GPL(dma_buf_vunmap); +/**
- dma_buf_get_constraints - get the *current* constraints of the dmabuf,
- as calculated during each attach(); returns error on invalid inputs
- @dmabuf: [in] buffer to get constraints of
- @constraints: [out] current constraints are returned in this
- */
+int dma_buf_get_constraints(struct dma_buf *dmabuf,
struct device_dma_parameters *constraints)
+{
- if (WARN_ON(!dmabuf || !constraints))
return -EINVAL;
- mutex_lock(&dmabuf->lock);
- *constraints = dmabuf->constraints;
- mutex_unlock(&dmabuf->lock);
- return 0;
+} +EXPORT_SYMBOL_GPL(dma_buf_get_constraints);
+/**
- dma_buf_recalc_constraints - *recalculate* the constraints for the buffer
- afresh, from the list of currently attached devices; this could be useful
- cross-check the current constraints, for exporters that might want to be
- 'paranoid' about the device constraints.
- returns error on invalid inputs
- @dmabuf: [in] buffer to get constraints of
- @constraints: [out] recalculated constraints are returned in this
- */
+int dma_buf_recalc_constraints(struct dma_buf *dmabuf,
struct device_dma_parameters *constraints)
+{
- struct device_dma_parameters calc_cons;
- int ret = 0;
- if (WARN_ON(!dmabuf || !constraints))
return -EINVAL;
- mutex_lock(&dmabuf->lock);
- ret = recalc_constraints(dmabuf, &calc_cons);
- if (!ret)
*constraints = calc_cons;
- mutex_unlock(&dmabuf->lock);
- return ret;
+} +EXPORT_SYMBOL_GPL(dma_buf_recalc_constraints);
- #ifdef CONFIG_DEBUG_FS static int dma_buf_describe(struct seq_file *s) {
@@ -801,6 +920,11 @@ static int dma_buf_describe(struct seq_file *s) buf_obj->file->f_flags, buf_obj->file->f_mode, file_count(buf_obj->file), buf_obj->exp_name);
seq_printf(s, "\tConstraints: Seg Count: %08u, Seg Size: %08u",
buf_obj->constraints.max_segment_count,
buf_obj->constraints.max_segment_size);
seq_printf(s, " seg boundary mask: %08lx\n",
buf_obj->constraints.segment_boundary_mask);
seq_puts(s, "\tAttached Devices:\n"); attach_count = 0; diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index 694e1fe1c4b4..489ad9b2e5ae 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -34,6 +34,7 @@ #include <linux/wait.h> struct device; +struct device_dma_parameters; struct dma_buf; struct dma_buf_attachment; @@ -116,6 +117,7 @@ struct dma_buf_ops {
- @ops: dma_buf_ops associated with this buffer object.
- @exp_name: name of the exporter; useful for debugging.
- @list_node: node for dma_buf accounting and debugging.
*/
- @constraints: calculated constraints of attached devices.
- @priv: exporter specific private data for this buffer object.
- @resv: reservation object linked to this dma-buf
@@ -130,6 +132,7 @@ struct dma_buf { void *vmap_ptr; const char *exp_name; struct list_head list_node;
- struct device_dma_parameters constraints; void *priv; struct reservation_object *resv;
@@ -211,4 +214,8 @@ void *dma_buf_vmap(struct dma_buf *); void dma_buf_vunmap(struct dma_buf *, void *vaddr); int dma_buf_debugfs_create_file(const char *name, int (*write)(struct seq_file *));
+int dma_buf_get_constraints(struct dma_buf *, struct device_dma_parameters *); +int dma_buf_recalc_constraints(struct dma_buf *,
#endif /* __DMA_BUF_H__ */struct device_dma_parameters *);
Best regards
On Wed, Feb 11, 2015 at 09:28:37AM +0100, Marek Szyprowski wrote:
Hello,
On 2015-01-27 09:25, Sumit Semwal wrote:
Add some helpers to share the constraints of devices while attaching to the dmabuf buffer.
At each attach, the constraints are calculated based on the following:
- max_segment_size, max_segment_count, segment_boundary_mask from device_dma_parameters.
In case the attaching device's constraints don't match up, attach() fails.
At detach, the constraints are recalculated based on the remaining attached devices.
Two helpers are added:
- dma_buf_get_constraints - which gives the current constraints as calculated during each attach on the buffer till the time,
- dma_buf_recalc_constraints - which recalculates the constraints for all currently attached devices for the 'paranoid' ones amongst us.
The idea of this patch is largely taken from Rob Clark's RFC at https://lkml.org/lkml/2012/7/19/285, and the comments received on it.
Cc: Rob Clark robdclark@gmail.com Signed-off-by: Sumit Semwal sumit.semwal@linaro.org
The code looks okay, although it will probably will work well only with typical cases like 'contiguous memory needed' or 'no constraints at all' (iommu).
Which is a damn good reason to NAK it - by that admission, it's a half-baked idea.
If all we want to know is whether the importer can accept only contiguous memory or not, make a flag to do that, and allow the exporter to test this flag. Don't over-engineer this to make it _seem_ like it can do something that it actually totally fails with.
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
On Wed, Feb 11, 2015 at 6:12 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Wed, Feb 11, 2015 at 09:28:37AM +0100, Marek Szyprowski wrote:
Hello,
On 2015-01-27 09:25, Sumit Semwal wrote:
Add some helpers to share the constraints of devices while attaching to the dmabuf buffer.
At each attach, the constraints are calculated based on the following:
- max_segment_size, max_segment_count, segment_boundary_mask from device_dma_parameters.
In case the attaching device's constraints don't match up, attach() fails.
At detach, the constraints are recalculated based on the remaining attached devices.
Two helpers are added:
- dma_buf_get_constraints - which gives the current constraints as calculated during each attach on the buffer till the time,
- dma_buf_recalc_constraints - which recalculates the constraints for all currently attached devices for the 'paranoid' ones amongst us.
The idea of this patch is largely taken from Rob Clark's RFC at https://lkml.org/lkml/2012/7/19/285, and the comments received on it.
Cc: Rob Clark robdclark@gmail.com Signed-off-by: Sumit Semwal sumit.semwal@linaro.org
The code looks okay, although it will probably will work well only with typical cases like 'contiguous memory needed' or 'no constraints at all' (iommu).
Which is a damn good reason to NAK it - by that admission, it's a half-baked idea.
If all we want to know is whether the importer can accept only contiguous memory or not, make a flag to do that, and allow the exporter to test this flag. Don't over-engineer this to make it _seem_ like it can do something that it actually totally fails with.
jfyi, I agree with that.. I think the flag is probably the right approach to start with. At the end of the day it *is* still just an in-kernel API (and not something that ends up as userspace ABI) so when we come up with the use case to make it more generic we can. Vs. making it look like something more generic when it isn't really yet.
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
to be completely generic, we would really need a way that the device could take over only just the last iommu (in case there were multiple levels of address translation)..
I'm not completely sure, but I *think* the other arm gpu's have their own internal mmu for doing context switching, etc, so if there is an additional iommu in front of them they may actually still want to use the normal dma api's. Someone please contradict me if I am wrong. If this ends up being an issue only for msm, then I'm completely ok with the easier option of a less generic solution..
BR, -R
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
On Wed, Feb 11, 2015 at 06:23:52AM -0500, Rob Clark wrote:
On Wed, Feb 11, 2015 at 6:12 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
to be completely generic, we would really need a way that the device could take over only just the last iommu (in case there were multiple levels of address translation)..
I still hold that if the dma api steals the iommu your gpu needs for context switching then that's a bug in the platform setup code. dma api really doesn't have any concept of switchable hw contexts. So trying to work around this brokeness by mandating it as a valid dma-buf use-case is totally backwards. -Daniel
On Wed, Feb 11, 2015 at 7:56 AM, Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Feb 11, 2015 at 06:23:52AM -0500, Rob Clark wrote:
On Wed, Feb 11, 2015 at 6:12 AM, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
to be completely generic, we would really need a way that the device could take over only just the last iommu (in case there were multiple levels of address translation)..
I still hold that if the dma api steals the iommu your gpu needs for context switching then that's a bug in the platform setup code. dma api really doesn't have any concept of switchable hw contexts. So trying to work around this brokeness by mandating it as a valid dma-buf use-case is totally backwards.
sure, my only point is that if I'm the odd man out, I can live with a hack (ie. requiring drm/msm to be aware enough of the platform to know if there is >1 level of address translation and frob'ing my 'struct device' accordingly)... no point in a generic solution for one user. I like to be practical.
BR, -R
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
Hello,
On 2015-02-11 12:12, Russell King - ARM Linux wrote:
On Wed, Feb 11, 2015 at 09:28:37AM +0100, Marek Szyprowski wrote:
On 2015-01-27 09:25, Sumit Semwal wrote:
Add some helpers to share the constraints of devices while attaching to the dmabuf buffer.
At each attach, the constraints are calculated based on the following:
- max_segment_size, max_segment_count, segment_boundary_mask from device_dma_parameters.
In case the attaching device's constraints don't match up, attach() fails.
At detach, the constraints are recalculated based on the remaining attached devices.
Two helpers are added:
- dma_buf_get_constraints - which gives the current constraints as calculated during each attach on the buffer till the time,
- dma_buf_recalc_constraints - which recalculates the constraints for all currently attached devices for the 'paranoid' ones amongst us.
The idea of this patch is largely taken from Rob Clark's RFC at https://lkml.org/lkml/2012/7/19/285, and the comments received on it.
Cc: Rob Clark robdclark@gmail.com Signed-off-by: Sumit Semwal sumit.semwal@linaro.org
The code looks okay, although it will probably will work well only with typical cases like 'contiguous memory needed' or 'no constraints at all' (iommu).
Which is a damn good reason to NAK it - by that admission, it's a half-baked idea.
If all we want to know is whether the importer can accept only contiguous memory or not, make a flag to do that, and allow the exporter to test this flag. Don't over-engineer this to make it _seem_ like it can do something that it actually totally fails with.
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
Okay, but flag-based approach also have limitations.
Frankly, if we want to make it really portable and sharable between devices, then IMO we should get rid of struct scatterlist and replace it with simple array of pfns in dma_buf. This way at least the problem of missing struct page will be solved and the buffer representation will be also a bit more compact.
Such solution however also requires extending dma-mapping API to handle mapping and unmapping of such pfn arrays. The advantage of this approach is the fact that it will be completely new API, so it can be designed well from the beginning.
Best regards
On Wed, Feb 11, 2015 at 01:20:24PM +0100, Marek Szyprowski wrote:
Hello,
On 2015-02-11 12:12, Russell King - ARM Linux wrote:
Which is a damn good reason to NAK it - by that admission, it's a half-baked idea.
If all we want to know is whether the importer can accept only contiguous memory or not, make a flag to do that, and allow the exporter to test this flag. Don't over-engineer this to make it _seem_ like it can do something that it actually totally fails with.
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
Okay, but flag-based approach also have limitations.
Yes, the flag-based approach doesn't let you describe in detail what the importer can accept - which, given the issues that I've raised is a *good* thing. We won't be misleading anyone into thinking that we can do something that's really half-baked, and which we have no present requirement for.
This is precisely what Linus talks about when he says "don't over- engineer" - if we over-engineer this, we end up with something that sort-of works, and that's a bad thing.
The Keep It Simple approach here makes total sense - what are our current requirements - to be able to say that an importer can only accept: - contiguous memory rather than a scatterlist - scatterlists with struct page pointers
Does solving that need us to compare all the constraints of each and every importer, possibly ending up with constraints which can't be satisfied? No. Does the flag approach satisfy the requirements? Yes.
Frankly, if we want to make it really portable and sharable between devices, then IMO we should get rid of struct scatterlist and replace it with simple array of pfns in dma_buf. This way at least the problem of missing struct page will be solved and the buffer representation will be also a bit more compact.
... and move the mapping and unmapping of the PFNs to the importer, which IMHO is where it should already be (so the importer can decide when it should map the buffer itself independently of getting the details of the buffer.)
Such solution however also requires extending dma-mapping API to handle mapping and unmapping of such pfn arrays. The advantage of this approach is the fact that it will be completely new API, so it can be designed well from the beginning.
As far as I'm aware, there was no big discussion of the dma_buf API - it's something that just appeared one day (I don't remember seeing it discussed.) So, that may well be a good thing if it means we can get these kinds of details better hammered out.
However, I don't share your view of "completely new API" - that would be far too disruptive. I think we can modify the existing API, to achieve our goals.
I don't think changing the dma-mapping API just for this case is really on though - if we're passed a list of PFNs, they either must be all associated with a struct page - iow, pfn_valid(pfn) returns true for all of them or none of them. If none of them, then we need to be able to convert those PFNs to a dma_addr_t for the target device (yes, that may need the dma-mapping API augmenting.)
However, if they are associated with a struct page, then we should use the established APIs and use a scatterlist, otherwise we're looking at rewriting all IOMMUs and all DMA mapping implementations to handle what would become a special case for dma_buf.
I'd rather... Keep It Simple.
So, maybe, as a first step, let's augment dma_buf with a pair of functions which get the "raw" unmapped scatterlist:
struct sg_table *dma_buf_get_scatterlist(struct dma_buf_attachment *attach) { struct sg_table *sg_table;
might_sleep();
if (!attach->dmabuf->ops->get_scatterlist) return ERR_PTR(-EINVAL);
sg_table = attach->dmabuf->ops->get_scatterlist(attach); if (!sg_table) sg_table = ERR_PTR(-ENOMEM);
return sg_table; }
void dma_buf_put_scatterlist(struct dma_buf_attachment *attach, struct sg_table *sg_table) { might_sleep();
attach->dmabuf->ops->put_scatterlist(attach, sg_table); }
Implementations should arrange for dma_buf_get_scatterlist() to return the EINVAL error pointer if they are unable to provide an unmapped scatterlist (in other words, they are exporting a set of PFNs or already-mapped dma_addr_t's.) This can be done by either not implementing the get_scatterlist method, or by implementing it and returning that forementioned error pointer value.
Where these two are implemented and used, the importer is responsible for calling dma_map_sg() and dma_unmap_sg() on the returned scatterlist table.
unsigned long *dma_buf_get_pfns(struct dma_buf_attachment *attach) { unsigned long *pfns;
might_sleep();
if (!attach->dmabuf->ops->get_pfns) return ERR_PTR(-EINVAL);
return attach->dmabuf->ops->get_pfns(attach); }
void dma_buf_put_pfns(struct dma_buf_attachment *attach, unsigned long *pfns) { might_sleep();
attach->dmabuf->ops->put_pfns(attach, pfns); }
Similar to the above, but this gets a list of PFNs. Each PFN entry prior to the last describes a page starting at offset 0 extending to the end of the page. The last PFN entry describes a page starting at offset 0 and extending to the offset of the attachment size within the page. Again, if not implemented or it is not possible to represent the buffer as PFNs, it returns -EINVAL.
For the time being, we keep the existing dma_buf_map_attachment() and dma_buf_unmap_attachment() while we transition users over to the new interfaces.
We may wish to augment struct dma_buf_attachment with a couple of flags to indicate which of these the attached dma_buf supports, so that drivers can deal with these themselves.
We may also wish in the longer term to keep dma_buf_map_attachment() but implement it as a wrapper around get_scatterlist() as a helper to provide its existing functionality - providing a mapped scatterlist. Possibly also including get_pfns() in that as well if we need to.
However, I would still like more thought put into Rob's issue to see whether we can solve his problem with using the dma_addr_t in a more elegant way. (I wish I had some hardware to experiment with for that.)
Hi Russell, everyone,
First up, sincere apologies for being awol for sometime; had some personal / medical things to take care of, and then I thought I'd wait for the merge window to get over before beginning to discuss this again.
On 11 February 2015 at 21:53, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Wed, Feb 11, 2015 at 01:20:24PM +0100, Marek Szyprowski wrote:
Hello,
On 2015-02-11 12:12, Russell King - ARM Linux wrote:
Which is a damn good reason to NAK it - by that admission, it's a half-baked idea.
If all we want to know is whether the importer can accept only contiguous memory or not, make a flag to do that, and allow the exporter to test this flag. Don't over-engineer this to make it _seem_ like it can do something that it actually totally fails with.
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
Okay, but flag-based approach also have limitations.
Yes, the flag-based approach doesn't let you describe in detail what the importer can accept - which, given the issues that I've raised is a *good* thing. We won't be misleading anyone into thinking that we can do something that's really half-baked, and which we have no present requirement for.
This is precisely what Linus talks about when he says "don't over- engineer" - if we over-engineer this, we end up with something that sort-of works, and that's a bad thing.
The Keep It Simple approach here makes total sense - what are our current requirements - to be able to say that an importer can only accept:
- contiguous memory rather than a scatterlist
- scatterlists with struct page pointers
Does solving that need us to compare all the constraints of each and every importer, possibly ending up with constraints which can't be satisfied? No. Does the flag approach satisfy the requirements? Yes.
So, for basic constraint-sharing, we'll just go with the flag based approach, with a flag (best place for it is still dev->dma_params I suppose) for denoting contiguous or scatterlist. Is that agreed, then? Also, with this idea, of course, there won't be any helpers for trying to calculate constraints; it would be totally the exporter's responsibility to handle it via the attach() dma_buf_op if it wishes to.
Frankly, if we want to make it really portable and sharable between devices, then IMO we should get rid of struct scatterlist and replace it with simple array of pfns in dma_buf. This way at least the problem of missing struct page will be solved and the buffer representation will be also a bit more compact.
... and move the mapping and unmapping of the PFNs to the importer, which IMHO is where it should already be (so the importer can decide when it should map the buffer itself independently of getting the details of the buffer.)
Such solution however also requires extending dma-mapping API to handle mapping and unmapping of such pfn arrays. The advantage of this approach is the fact that it will be completely new API, so it can be designed well from the beginning.
As far as I'm aware, there was no big discussion of the dma_buf API - it's something that just appeared one day (I don't remember seeing it discussed.) So, that may well be a good thing if it means we can get these kinds of details better hammered out.
However, I don't share your view of "completely new API" - that would be far too disruptive. I think we can modify the existing API, to achieve our goals.
I don't think changing the dma-mapping API just for this case is really on though - if we're passed a list of PFNs, they either must be all associated with a struct page - iow, pfn_valid(pfn) returns true for all of them or none of them. If none of them, then we need to be able to convert those PFNs to a dma_addr_t for the target device (yes, that may need the dma-mapping API augmenting.)
However, if they are associated with a struct page, then we should use the established APIs and use a scatterlist, otherwise we're looking at rewriting all IOMMUs and all DMA mapping implementations to handle what would become a special case for dma_buf.
I'd rather... Keep It Simple.
+1 for Keep it simple, and the idea overall. Though I suspect more dma-buf users (dri / v4l friends?) should comment if this doesn't help solve things on some platforms / subsystems that they care about.
So, maybe, as a first step, let's augment dma_buf with a pair of functions which get the "raw" unmapped scatterlist:
struct sg_table *dma_buf_get_scatterlist(struct dma_buf_attachment *attach) { struct sg_table *sg_table;
might_sleep(); if (!attach->dmabuf->ops->get_scatterlist) return ERR_PTR(-EINVAL); sg_table = attach->dmabuf->ops->get_scatterlist(attach); if (!sg_table) sg_table = ERR_PTR(-ENOMEM); return sg_table;
}
void dma_buf_put_scatterlist(struct dma_buf_attachment *attach, struct sg_table *sg_table) { might_sleep();
attach->dmabuf->ops->put_scatterlist(attach, sg_table);
}
Implementations should arrange for dma_buf_get_scatterlist() to return the EINVAL error pointer if they are unable to provide an unmapped scatterlist (in other words, they are exporting a set of PFNs or already-mapped dma_addr_t's.) This can be done by either not implementing the get_scatterlist method, or by implementing it and returning that forementioned error pointer value.
Where these two are implemented and used, the importer is responsible for calling dma_map_sg() and dma_unmap_sg() on the returned scatterlist table.
unsigned long *dma_buf_get_pfns(struct dma_buf_attachment *attach) { unsigned long *pfns;
might_sleep(); if (!attach->dmabuf->ops->get_pfns) return ERR_PTR(-EINVAL); return attach->dmabuf->ops->get_pfns(attach);
}
void dma_buf_put_pfns(struct dma_buf_attachment *attach, unsigned long *pfns) { might_sleep();
attach->dmabuf->ops->put_pfns(attach, pfns);
}
Similar to the above, but this gets a list of PFNs. Each PFN entry prior to the last describes a page starting at offset 0 extending to the end of the page. The last PFN entry describes a page starting at offset 0 and extending to the offset of the attachment size within the page. Again, if not implemented or it is not possible to represent the buffer as PFNs, it returns -EINVAL.
For the time being, we keep the existing dma_buf_map_attachment() and dma_buf_unmap_attachment() while we transition users over to the new interfaces.
We may wish to augment struct dma_buf_attachment with a couple of flags to indicate which of these the attached dma_buf supports, so that drivers can deal with these themselves.
Could I please send a patchset first for the constraint-flags addition, and then take this up in a following patch-set, once we have agreement from other stake holders as well?
Russell: would it be ok if we discuss it offline more? I am afraid I am not as knowledgeable on this topic, and would probably request your help / guidance here.
We may also wish in the longer term to keep dma_buf_map_attachment() but implement it as a wrapper around get_scatterlist() as a helper to provide its existing functionality - providing a mapped scatterlist. Possibly also including get_pfns() in that as well if we need to.
However, I would still like more thought put into Rob's issue to see whether we can solve his problem with using the dma_addr_t in a more elegant way. (I wish I had some hardware to experiment with for that.)
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
Best regards, ~Sumit.
Hi Sumit,
On 05/05/2015 04:41 PM, Sumit Semwal wrote:
Hi Russell, everyone,
First up, sincere apologies for being awol for sometime; had some personal / medical things to take care of, and then I thought I'd wait for the merge window to get over before beginning to discuss this again.
On 11 February 2015 at 21:53, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Wed, Feb 11, 2015 at 01:20:24PM +0100, Marek Szyprowski wrote:
Hello,
On 2015-02-11 12:12, Russell King - ARM Linux wrote:
Which is a damn good reason to NAK it - by that admission, it's a half-baked idea.
If all we want to know is whether the importer can accept only contiguous memory or not, make a flag to do that, and allow the exporter to test this flag. Don't over-engineer this to make it _seem_ like it can do something that it actually totally fails with.
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
Okay, but flag-based approach also have limitations.
Yes, the flag-based approach doesn't let you describe in detail what the importer can accept - which, given the issues that I've raised is a *good* thing. We won't be misleading anyone into thinking that we can do something that's really half-baked, and which we have no present requirement for.
This is precisely what Linus talks about when he says "don't over- engineer" - if we over-engineer this, we end up with something that sort-of works, and that's a bad thing.
The Keep It Simple approach here makes total sense - what are our current requirements - to be able to say that an importer can only accept:
- contiguous memory rather than a scatterlist
- scatterlists with struct page pointers
Does solving that need us to compare all the constraints of each and every importer, possibly ending up with constraints which can't be satisfied? No. Does the flag approach satisfy the requirements? Yes.
So, for basic constraint-sharing, we'll just go with the flag based approach, with a flag (best place for it is still dev->dma_params I suppose) for denoting contiguous or scatterlist. Is that agreed, then? Also, with this idea, of course, there won't be any helpers for trying to calculate constraints; it would be totally the exporter's responsibility to handle it via the attach() dma_buf_op if it wishes to.
What's wrong with the proposed max_segment_count? Many media devices do have a limited max_segment_count and that should be taken into account.
Although to be fair I have to say that in practice the limited segment counts are all well above what is needed for a normal capture buffer, but it might be reached if you have video strides > line width, so each line has its own segment (or two, if there is a page boundary in the middle of the line). It's a somewhat contrived use-case, though, although I have done this once myself.
Anyway, the worst-case number of segments would be:
height * ((bytesperline + PAGE_SIZE - 1) / PAGE_SIZE)
That might well exceed the max segment count for some devices.
I think I have also seen devices in the past that have a coarse-grained IOMMU that uses very large segment sizes, but have a very limited segment count. We don't have media drivers like that, and to be honest I am not sure whether we should bother too much with this corner case.
Frankly, if we want to make it really portable and sharable between devices, then IMO we should get rid of struct scatterlist and replace it with simple array of pfns in dma_buf. This way at least the problem of missing struct page will be solved and the buffer representation will be also a bit more compact.
... and move the mapping and unmapping of the PFNs to the importer, which IMHO is where it should already be (so the importer can decide when it should map the buffer itself independently of getting the details of the buffer.)
Such solution however also requires extending dma-mapping API to handle mapping and unmapping of such pfn arrays. The advantage of this approach is the fact that it will be completely new API, so it can be designed well from the beginning.
As far as I'm aware, there was no big discussion of the dma_buf API - it's something that just appeared one day (I don't remember seeing it discussed.) So, that may well be a good thing if it means we can get these kinds of details better hammered out.
However, I don't share your view of "completely new API" - that would be far too disruptive. I think we can modify the existing API, to achieve our goals.
I don't think changing the dma-mapping API just for this case is really on though - if we're passed a list of PFNs, they either must be all associated with a struct page - iow, pfn_valid(pfn) returns true for all of them or none of them. If none of them, then we need to be able to convert those PFNs to a dma_addr_t for the target device (yes, that may need the dma-mapping API augmenting.)
However, if they are associated with a struct page, then we should use the established APIs and use a scatterlist, otherwise we're looking at rewriting all IOMMUs and all DMA mapping implementations to handle what would become a special case for dma_buf.
I'd rather... Keep It Simple.
+1 for Keep it simple, and the idea overall. Though I suspect more dma-buf users (dri / v4l friends?) should comment if this doesn't help solve things on some platforms / subsystems that they care about.
So, maybe, as a first step, let's augment dma_buf with a pair of functions which get the "raw" unmapped scatterlist:
struct sg_table *dma_buf_get_scatterlist(struct dma_buf_attachment *attach) { struct sg_table *sg_table;
might_sleep(); if (!attach->dmabuf->ops->get_scatterlist) return ERR_PTR(-EINVAL); sg_table = attach->dmabuf->ops->get_scatterlist(attach); if (!sg_table) sg_table = ERR_PTR(-ENOMEM); return sg_table;
}
void dma_buf_put_scatterlist(struct dma_buf_attachment *attach, struct sg_table *sg_table) { might_sleep();
attach->dmabuf->ops->put_scatterlist(attach, sg_table);
}
Implementations should arrange for dma_buf_get_scatterlist() to return the EINVAL error pointer if they are unable to provide an unmapped scatterlist (in other words, they are exporting a set of PFNs or already-mapped dma_addr_t's.) This can be done by either not implementing the get_scatterlist method, or by implementing it and returning that forementioned error pointer value.
Where these two are implemented and used, the importer is responsible for calling dma_map_sg() and dma_unmap_sg() on the returned scatterlist table.
unsigned long *dma_buf_get_pfns(struct dma_buf_attachment *attach) { unsigned long *pfns;
might_sleep(); if (!attach->dmabuf->ops->get_pfns) return ERR_PTR(-EINVAL); return attach->dmabuf->ops->get_pfns(attach);
}
void dma_buf_put_pfns(struct dma_buf_attachment *attach, unsigned long *pfns) { might_sleep();
attach->dmabuf->ops->put_pfns(attach, pfns);
}
Similar to the above, but this gets a list of PFNs. Each PFN entry prior to the last describes a page starting at offset 0 extending to the end of the page. The last PFN entry describes a page starting at offset 0 and extending to the offset of the attachment size within the page. Again, if not implemented or it is not possible to represent the buffer as PFNs, it returns -EINVAL.
For the time being, we keep the existing dma_buf_map_attachment() and dma_buf_unmap_attachment() while we transition users over to the new interfaces.
We may wish to augment struct dma_buf_attachment with a couple of flags to indicate which of these the attached dma_buf supports, so that drivers can deal with these themselves.
Could I please send a patchset first for the constraint-flags addition, and then take this up in a following patch-set, once we have agreement from other stake holders as well?
+1
One of the main problems end-users are faced with today is that they do not know which device should be the exporter of buffers and which should be the importer. This depends on the constraints and right now applications have no way of knowing this. It's nuts that this hasn't been addressed yet since it is the main complaint I am getting.
And it is something that the V4L2 API needs to export to userland anyway for both the above and some other V4L2-specific reasons.
BTW, one thing I didn't see in the constraints discussion (but I didn't read everything, so I may very well have missed it), is reporting whether memory is opaque (i.e. cannot be mapped by non-trusted devices). No, I don't have such hardware, but I know it exists. All the constraints mentioned so far apply to such memory as well, but this is an additional flag (or perhaps some trust ID). I don't think anyone made drivers for such hardware, although I recently saw someone working on something like that.
Does anyone know about someone working in this area? Or existing code for this?
All I am saying is that this is sure to appear at some point.
Regards,
Hans
Russell: would it be ok if we discuss it offline more? I am afraid I am not as knowledgeable on this topic, and would probably request your help / guidance here.
We may also wish in the longer term to keep dma_buf_map_attachment() but implement it as a wrapper around get_scatterlist() as a helper to provide its existing functionality - providing a mapped scatterlist. Possibly also including get_pfns() in that as well if we need to.
However, I would still like more thought put into Rob's issue to see whether we can solve his problem with using the dma_addr_t in a more elegant way. (I wish I had some hardware to experiment with for that.)
-- FTTC broadband for 0.8mile line: currently at 10.5Mbps down 400kbps up according to speedtest.net.
Best regards, ~Sumit. _______________________________________________ Linaro-mm-sig mailing list Linaro-mm-sig@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
On Wed, Jun 03, 2015 at 08:39:55AM +0200, Hans Verkuil wrote:
Hi Sumit,
On 05/05/2015 04:41 PM, Sumit Semwal wrote:
Hi Russell, everyone,
First up, sincere apologies for being awol for sometime; had some personal / medical things to take care of, and then I thought I'd wait for the merge window to get over before beginning to discuss this again.
On 11 February 2015 at 21:53, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Wed, Feb 11, 2015 at 01:20:24PM +0100, Marek Szyprowski wrote:
Hello,
On 2015-02-11 12:12, Russell King - ARM Linux wrote:
Which is a damn good reason to NAK it - by that admission, it's a half-baked idea.
If all we want to know is whether the importer can accept only contiguous memory or not, make a flag to do that, and allow the exporter to test this flag. Don't over-engineer this to make it _seem_ like it can do something that it actually totally fails with.
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
Okay, but flag-based approach also have limitations.
Yes, the flag-based approach doesn't let you describe in detail what the importer can accept - which, given the issues that I've raised is a *good* thing. We won't be misleading anyone into thinking that we can do something that's really half-baked, and which we have no present requirement for.
This is precisely what Linus talks about when he says "don't over- engineer" - if we over-engineer this, we end up with something that sort-of works, and that's a bad thing.
The Keep It Simple approach here makes total sense - what are our current requirements - to be able to say that an importer can only accept:
- contiguous memory rather than a scatterlist
- scatterlists with struct page pointers
Does solving that need us to compare all the constraints of each and every importer, possibly ending up with constraints which can't be satisfied? No. Does the flag approach satisfy the requirements? Yes.
So, for basic constraint-sharing, we'll just go with the flag based approach, with a flag (best place for it is still dev->dma_params I suppose) for denoting contiguous or scatterlist. Is that agreed, then? Also, with this idea, of course, there won't be any helpers for trying to calculate constraints; it would be totally the exporter's responsibility to handle it via the attach() dma_buf_op if it wishes to.
What's wrong with the proposed max_segment_count? Many media devices do have a limited max_segment_count and that should be taken into account.
So what happens if you have a dma_buf exporter, and several dma_buf importers. One dma_buf importer attaches to the exporter, and asks for the buffer, and starts making use of the buffer. This export has many scatterlist segments.
Another dma_buf importer attaches to the same buffer, and now asks for the buffer, but the number of scatterlist segments exceeds it's requirement.
You can't reallocate the buffer because it's in-use by another importer. There is no way to revoke the buffer from the other importer. So there is no way to satisfy this importer's requirements.
What I'm showing is that the idea that exporting these parameters fixes some problem is just an illusion - it may work for the single importer case, but doesn't for the multiple importer case.
Importers really have two choices here: either they accept what the exporter is giving them, or they reject it.
The other issue here is that DMA scatterlists are _not_ really that determinable in terms of number of entries when it comes to systems with system IOMMUs. System IOMMUs, which should be integrated into the DMA API, are permitted to coalesce entries in the physical page range. For example:
nsg = 128; n = dma_map_sg(dev, sg, nsg, DMA_TO_DEVICE);
Here, n might be 4 if the system IOMMU has been able to coalesce the 128 entries down to 4 IOMMU entries - and that means for DMA purposes, only the first four scatterlist entries should be walked (this is why dma_map_sg() returns a positive integer when mapping.)
Each struct device has a set of parameters which control how the IOMMU entries are coalesced:
struct device_dma_parameters { /* * a low level driver may set these to teach IOMMU code about * sg limitations. */ unsigned int max_segment_size; unsigned long segment_boundary_mask; };
and this is independent of the dma_buf API. This doesn't indicate the maximum number of segments, but as I've shown above, it's not something that you can say "I want a scatterlist for this memory with only 32 segments" so it's totally unclear how an exporter would limit that.
The only thing an exporter could do would be to fail the export if the buffer didn't end up having fewer than the requested scatterlist entries, which is something the importer can do too.
One of the main problems end-users are faced with today is that they do not know which device should be the exporter of buffers and which should be the importer. This depends on the constraints and right now applications have no way of knowing this. It's nuts that this hasn't been addressed yet since it is the main complaint I am getting.
IT's nuts that we've ended up in this situation in the first place. This was bound to happen as soon as the dma_buf sharing was introduced, because it immediately introduced this problem. I don't think there is any easy solution to it, and what's being proposed with flags and other stuff is just trying to paper over the problem.
What you're actually asking is that each dmabuf exporting subsystem needs to publish their DMA parameters to userspace, and userspace then gets to decide which dmabuf exporter should be used.
That's not a dmabuf problem, that's a subsystem problem, but even so, we don't have a standardised way to export that information (and I'd suspect that it would be very difficult to get agreements between subsystems on a standard ioctl and/or data structure.) In my experience, getting cross- subsystem agreement in the kernel with anything is very difficult, you normally end up with 60% of people agreeing, and the other 40% going off and doing something completely different because they object to it (figures vary, 90% of all statistics are made up on the spot!)
On 06/03/15 10:41, Russell King - ARM Linux wrote:
On Wed, Jun 03, 2015 at 08:39:55AM +0200, Hans Verkuil wrote:
Hi Sumit,
On 05/05/2015 04:41 PM, Sumit Semwal wrote:
Hi Russell, everyone,
First up, sincere apologies for being awol for sometime; had some personal / medical things to take care of, and then I thought I'd wait for the merge window to get over before beginning to discuss this again.
On 11 February 2015 at 21:53, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Wed, Feb 11, 2015 at 01:20:24PM +0100, Marek Szyprowski wrote:
Hello,
On 2015-02-11 12:12, Russell King - ARM Linux wrote:
Which is a damn good reason to NAK it - by that admission, it's a half-baked idea.
If all we want to know is whether the importer can accept only contiguous memory or not, make a flag to do that, and allow the exporter to test this flag. Don't over-engineer this to make it _seem_ like it can do something that it actually totally fails with.
As I've already pointed out, there's a major problem if you have already had a less restrictive attachment which has an active mapping, and a new more restrictive attachment comes along later.
It seems from Rob's descriptions that we also need another flag in the importer to indicate whether it wants to have a valid struct page in the scatter list, or whether it (correctly) uses the DMA accessors on the scatter list - so that exporters can reject importers which are buggy.
Okay, but flag-based approach also have limitations.
Yes, the flag-based approach doesn't let you describe in detail what the importer can accept - which, given the issues that I've raised is a *good* thing. We won't be misleading anyone into thinking that we can do something that's really half-baked, and which we have no present requirement for.
This is precisely what Linus talks about when he says "don't over- engineer" - if we over-engineer this, we end up with something that sort-of works, and that's a bad thing.
The Keep It Simple approach here makes total sense - what are our current requirements - to be able to say that an importer can only accept:
- contiguous memory rather than a scatterlist
- scatterlists with struct page pointers
Does solving that need us to compare all the constraints of each and every importer, possibly ending up with constraints which can't be satisfied? No. Does the flag approach satisfy the requirements? Yes.
So, for basic constraint-sharing, we'll just go with the flag based approach, with a flag (best place for it is still dev->dma_params I suppose) for denoting contiguous or scatterlist. Is that agreed, then? Also, with this idea, of course, there won't be any helpers for trying to calculate constraints; it would be totally the exporter's responsibility to handle it via the attach() dma_buf_op if it wishes to.
What's wrong with the proposed max_segment_count? Many media devices do have a limited max_segment_count and that should be taken into account.
So what happens if you have a dma_buf exporter, and several dma_buf importers. One dma_buf importer attaches to the exporter, and asks for the buffer, and starts making use of the buffer. This export has many scatterlist segments.
Another dma_buf importer attaches to the same buffer, and now asks for the buffer, but the number of scatterlist segments exceeds it's requirement.
You can't reallocate the buffer because it's in-use by another importer. There is no way to revoke the buffer from the other importer. So there is no way to satisfy this importer's requirements.
What I'm showing is that the idea that exporting these parameters fixes some problem is just an illusion - it may work for the single importer case, but doesn't for the multiple importer case.
Importers really have two choices here: either they accept what the exporter is giving them, or they reject it.
I agree completely with that.
The other issue here is that DMA scatterlists are _not_ really that determinable in terms of number of entries when it comes to systems with system IOMMUs. System IOMMUs, which should be integrated into the DMA API, are permitted to coalesce entries in the physical page range. For example:
nsg = 128; n = dma_map_sg(dev, sg, nsg, DMA_TO_DEVICE);
Here, n might be 4 if the system IOMMU has been able to coalesce the 128 entries down to 4 IOMMU entries - and that means for DMA purposes, only the first four scatterlist entries should be walked (this is why dma_map_sg() returns a positive integer when mapping.)
Each struct device has a set of parameters which control how the IOMMU entries are coalesced:
struct device_dma_parameters { /* * a low level driver may set these to teach IOMMU code about * sg limitations. */ unsigned int max_segment_size; unsigned long segment_boundary_mask; };
and this is independent of the dma_buf API. This doesn't indicate the maximum number of segments, but as I've shown above, it's not something that you can say "I want a scatterlist for this memory with only 32 segments" so it's totally unclear how an exporter would limit that.
The only thing an exporter could do would be to fail the export if the buffer didn't end up having fewer than the requested scatterlist entries, which is something the importer can do too.
Right.
One of the main problems end-users are faced with today is that they do not know which device should be the exporter of buffers and which should be the importer. This depends on the constraints and right now applications have no way of knowing this. It's nuts that this hasn't been addressed yet since it is the main complaint I am getting.
IT's nuts that we've ended up in this situation in the first place. This was bound to happen as soon as the dma_buf sharing was introduced, because it immediately introduced this problem. I don't think there is any easy solution to it, and what's being proposed with flags and other stuff is just trying to paper over the problem.
This was the first thing raised in the initial discussions. My suggestion at the time was to give userspace limited information about the buffer restrictions: Physically contiguous, scatter-gather and 'weird'. But obviously you need segment_boundary_mask and max_segment_size as well.
And the application can decide based on that info which device has the most restrictive requirements and make that the exporter.
I am not sure whether there is any sense in exporting the max_segment_count to userspace (probably not), but I see no reason why we can't set it internally. That said, a single flag is OK for me as well.
What you're actually asking is that each dmabuf exporting subsystem needs to publish their DMA parameters to userspace, and userspace then gets to decide which dmabuf exporter should be used.
Yes, that was the initial plan.
That's not a dmabuf problem, that's a subsystem problem,
Well, yes, but it doesn't hurt to sync which DMA parameters are exposed with what dma-buf uses.
but even so, we don't have a standardised way to export that information (and I'd suspect that it would be very difficult to get agreements between subsystems on a standard ioctl and/or data structure.) In my experience, getting cross- subsystem agreement in the kernel with anything is very difficult, you normally end up with 60% of people agreeing, and the other 40% going off and doing something completely different because they object to it (figures vary, 90% of all statistics are made up on the spot!)
I don't care which ioctl or other mechanism a subsystem uses to expose the information. Each subsystem should design their own method for that. Imposing a standard API for that generally doesn't work for the reasons you give.
But deciding *which* minimum set of information is exposed, that is another matter and that should be synced. And the easiest starting point for that is that the device will store that information internally in device_dma_parameters.
The various subsystems can then make APIs to expose that info.
Regards,
Hans
On 3 June 2015 at 15:07, Hans Verkuil hverkuil@xs4all.nl wrote:
On 06/03/15 10:41, Russell King - ARM Linux wrote:
On Wed, Jun 03, 2015 at 08:39:55AM +0200, Hans Verkuil wrote:
Hi Sumit,
On 05/05/2015 04:41 PM, Sumit Semwal wrote:
Hi Russell, everyone,
First up, sincere apologies for being awol for sometime; had some personal / medical things to take care of, and then I thought I'd wait for the merge window to get over before beginning to discuss this again.
On 11 February 2015 at 21:53, Russell King - ARM Linux linux@arm.linux.org.uk wrote:
On Wed, Feb 11, 2015 at 01:20:24PM +0100, Marek Szyprowski wrote:
Hello,
On 2015-02-11 12:12, Russell King - ARM Linux wrote: > Which is a damn good reason to NAK it - by that admission, it's a half-baked > idea. > > If all we want to know is whether the importer can accept only contiguous > memory or not, make a flag to do that, and allow the exporter to test this > flag. Don't over-engineer this to make it _seem_ like it can do something > that it actually totally fails with. > > As I've already pointed out, there's a major problem if you have already > had a less restrictive attachment which has an active mapping, and a new > more restrictive attachment comes along later. > > It seems from Rob's descriptions that we also need another flag in the > importer to indicate whether it wants to have a valid struct page in the > scatter list, or whether it (correctly) uses the DMA accessors on the > scatter list - so that exporters can reject importers which are buggy.
Okay, but flag-based approach also have limitations.
Yes, the flag-based approach doesn't let you describe in detail what the importer can accept - which, given the issues that I've raised is a *good* thing. We won't be misleading anyone into thinking that we can do something that's really half-baked, and which we have no present requirement for.
This is precisely what Linus talks about when he says "don't over- engineer" - if we over-engineer this, we end up with something that sort-of works, and that's a bad thing.
The Keep It Simple approach here makes total sense - what are our current requirements - to be able to say that an importer can only accept:
- contiguous memory rather than a scatterlist
- scatterlists with struct page pointers
Does solving that need us to compare all the constraints of each and every importer, possibly ending up with constraints which can't be satisfied? No. Does the flag approach satisfy the requirements? Yes.
So, for basic constraint-sharing, we'll just go with the flag based approach, with a flag (best place for it is still dev->dma_params I suppose) for denoting contiguous or scatterlist. Is that agreed, then? Also, with this idea, of course, there won't be any helpers for trying to calculate constraints; it would be totally the exporter's responsibility to handle it via the attach() dma_buf_op if it wishes to.
What's wrong with the proposed max_segment_count? Many media devices do have a limited max_segment_count and that should be taken into account.
So what happens if you have a dma_buf exporter, and several dma_buf importers. One dma_buf importer attaches to the exporter, and asks for the buffer, and starts making use of the buffer. This export has many scatterlist segments.
Another dma_buf importer attaches to the same buffer, and now asks for the buffer, but the number of scatterlist segments exceeds it requirement.
So, in the midst of all the various directions this discussion has taken, I seem to have missed to reiterate the base premise for this suggestion [1] - that we can use this information to help implement a deferred allocation logic - so that all the importers can attach first, and the exporter can do the actual allocation on the first map() call. This is also inline with the prescribed usage of dma_buf_attach() / dma_buf_map_attachment() sequence - ideally speaking, all participating 'importers' of dma_buf should only attach first, and then map() at a 'later' time, which is usually right before using the buffer actually. Note: at present, both DRI and V4L subsystems don't do that; while proposing this RFC I had deliberately kept that separate, as it is a related but orthogonal problem to solve. I guess I should address that in parallel.
You can't reallocate the buffer because it's in-use by another importer. There is no way to revoke the buffer from the other importer. So there is no way to satisfy this importer's requirements.
You're right; but in a deferred allocation mechanism, this constraint-sharing can atleast help decide on the most restrictive allocation at the time of first map() ,and if later, an importer with more relaxed constraints attaches, it can still use the same buffer. A more restrictive importer would still be not allowed, but in that case the exporter can disallow that importer from attaching, and a feedback to userspace is possible.
What I'm showing is that the idea that exporting these parameters fixes some problem is just an illusion - it may work for the single importer case, but doesn't for the multiple importer case.
Importers really have two choices here: either they accept what the exporter is giving them, or they reject it.
I agree completely with that.
In a non-deferred allocation case, these constraints have no meaning, since it will always depend on the first subsystem to attach, irrespective of the exporter's possible capability to allocate from different allocators with different constraints; for example, if a subsystem with relaxed constraints attaches first, a later more restrictive constraints attach() request will fail, even though the exporter might have the ability to allocate with the more restricted constraints.
In deferred allocation, on the other hand, the exporter atleast gets the ability to possibly choose the allocation mechanism based on the match of most restricted importers, so the order of attach() ceases to matter.
The other issue here is that DMA scatterlists are _not_ really that determinable in terms of number of entries when it comes to systems with system IOMMUs. System IOMMUs, which should be integrated into the DMA API, are permitted to coalesce entries in the physical page range. For example:
nsg = 128; n = dma_map_sg(dev, sg, nsg, DMA_TO_DEVICE);
Here, n might be 4 if the system IOMMU has been able to coalesce the 128 entries down to 4 IOMMU entries - and that means for DMA purposes, only the first four scatterlist entries should be walked (this is why dma_map_sg() returns a positive integer when mapping.)
Each struct device has a set of parameters which control how the IOMMU entries are coalesced:
struct device_dma_parameters { /* * a low level driver may set these to teach IOMMU code about * sg limitations. */ unsigned int max_segment_size; unsigned long segment_boundary_mask; };
and this is independent of the dma_buf API. This doesn't indicate the maximum number of segments, but as I've shown above, it's not something that you can say "I want a scatterlist for this memory with only 32 segments" so it's totally unclear how an exporter would limit that.
The only thing an exporter could do would be to fail the buffer didn't end up having fewer than the requested scatterlist entries, which is something the importer can do too.
You're right in that after allocation is done, an exporter can only fail a more restrictive attach request, but I don't think the importer has any way of knowing the information about the current allocation? Unless I misunderstood something.
Right.
One of the main problems end-users are faced with today is that they do not know which device should be the exporter of buffers and which should be the importer. This depends on the constraints and right now applications have no way of knowing this. It's nuts that this hasn't been addressed yet since it is the main complaint I am getting.
One of the ways to try and solve this is via the deferred allocation mechanism described above; I hope it makes sense to you all, but if it doesn't, may I request you to please help me understand why not?
IT's nuts that we've ended up in this situation in the first place. This was bound to happen as soon as the dma_buf sharing was introduced, because it immediately introduced this problem. I don't think there is any easy solution to it, and what's being proposed with flags and other stuff is just trying to paper over the problem.
This was the first thing raised in the initial discussions. My suggestion at the time was to give userspace limited information about the buffer restrictions: Physically contiguous, scatter-gather and 'weird'. But obviously you need segment_boundary_mask and max_segment_size as well.
And the application can decide based on that info which device has the most restrictive requirements and make that the exporter.
I am not sure whether there is any sense in exporting the max_segment_count to userspace (probably not), but I see no reason why we can't set it internally. That said, a single flag is OK for me as well.
What you're actually asking is that each dmabuf exporting subsystem needs to publish their DMA parameters to userspace, and userspace then gets to decide which dmabuf exporter should be used.
Yes, that was the initial plan.
In absence of deferred allocation, that could be the other option. With deferred allocation, we could try and keep this internal to the kernel.
That's not a dmabuf problem, that's a subsystem problem,
Well, yes, but it doesn't hurt to sync which DMA parameters are exposed with what dma-buf uses.
but even so, we don't have a standardised way to export that information (and I'd suspect that it would be very difficult to get agreements between subsystems on a standard ioctl and/or data structure.) In my experience, getting cross- subsystem agreement in the kernel with anything is very difficult, you normally end up with 60% of people agreeing, and the other 40% going off and doing something completely different because they object to it (figures vary, 90% of all statistics are made up on the spot!)
I don't care which ioctl or other mechanism a subsystem uses to expose the information. Each subsystem should design their own method for that. Imposing a standard API for that generally doesn't work for the reasons you give.
But deciding *which* minimum set of information is exposed, that is another matter and that should be synced. And the easiest starting point for that is that the device will store that information internally in device_dma_parameters.
The various subsystems can then make APIs to expose that info.
Regards,
Hans
Best regards, Sumit.
Hello,
On 2015-01-27 09:25, Sumit Semwal wrote:
From: Rob Clark robdclark@gmail.com
For devices which have constraints about maximum number of segments in an sglist. For example, a device which could only deal with contiguous buffers would set max_segment_count to 1.
The initial motivation is for devices sharing buffers via dma-buf, to allow the buffer exporter to know the constraints of other devices which have attached to the buffer. The dma_mask and fields in 'struct device_dma_parameters' tell the exporter everything else that is needed, except whether the importer has constraints about maximum number of segments.
Signed-off-by: Rob Clark robdclark@gmail.com [sumits: Minor updates wrt comments] Signed-off-by: Sumit Semwal sumit.semwal@linaro.org
This feature is definitely needed to start thinking of real buffer sharing between devices.
Acked-by: Marek Szyprowski m.szyprowski@samsung.com
v3: include Robin Murphy's fix[1] for handling '0' as a value for max_segment_count v2: minor updates wrt comments on the first version
include/linux/device.h | 1 + include/linux/dma-mapping.h | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+)
diff --git a/include/linux/device.h b/include/linux/device.h index fb506738f7b7..a32f9b67315c 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -647,6 +647,7 @@ struct device_dma_parameters { * sg limitations. */ unsigned int max_segment_size;
- unsigned int max_segment_count; /* INT_MAX for unlimited */ unsigned long segment_boundary_mask; };
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index c3007cb4bfa6..d3351a36d5ec 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -154,6 +154,25 @@ static inline unsigned int dma_set_max_seg_size(struct device *dev, return -EIO; } +#define DMA_SEGMENTS_MAX_SEG_COUNT ((unsigned int) INT_MAX)
+static inline unsigned int dma_get_max_seg_count(struct device *dev) +{
- if (dev->dma_parms && dev->dma_parms->max_segment_count)
return dev->dma_parms->max_segment_count;
- return DMA_SEGMENTS_MAX_SEG_COUNT;
+}
+static inline int dma_set_max_seg_count(struct device *dev,
unsigned int count)
+{
- if (dev->dma_parms) {
dev->dma_parms->max_segment_count = count;
return 0;
- }
- return -EIO;
+}
- static inline unsigned long dma_get_seg_boundary(struct device *dev) { return dev->dma_parms ?
Best regards
Hi Sumit, Rob,
Is there any reason why this patch hasn't been merged yet? It makes perfect sense to me and I would really like to use this in the media drivers.
Many DMA engines do have a limit to the number of segments (obviously a max count of 1 being the most common limitation, but other limits are definitely possible), so this patch seems a no-brainer to me.
So:
Acked-by: Hans Verkuil hans.verkuil@cisco.com
Regards,
Hans
On 01/27/2015 09:25 AM, Sumit Semwal wrote:
From: Rob Clark robdclark@gmail.com
For devices which have constraints about maximum number of segments in an sglist. For example, a device which could only deal with contiguous buffers would set max_segment_count to 1.
The initial motivation is for devices sharing buffers via dma-buf, to allow the buffer exporter to know the constraints of other devices which have attached to the buffer. The dma_mask and fields in 'struct device_dma_parameters' tell the exporter everything else that is needed, except whether the importer has constraints about maximum number of segments.
Signed-off-by: Rob Clark robdclark@gmail.com [sumits: Minor updates wrt comments] Signed-off-by: Sumit Semwal sumit.semwal@linaro.org
v3: include Robin Murphy's fix[1] for handling '0' as a value for max_segment_count v2: minor updates wrt comments on the first version
include/linux/device.h | 1 + include/linux/dma-mapping.h | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+)
diff --git a/include/linux/device.h b/include/linux/device.h index fb506738f7b7..a32f9b67315c 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -647,6 +647,7 @@ struct device_dma_parameters { * sg limitations. */ unsigned int max_segment_size;
- unsigned int max_segment_count; /* INT_MAX for unlimited */ unsigned long segment_boundary_mask;
}; diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h index c3007cb4bfa6..d3351a36d5ec 100644 --- a/include/linux/dma-mapping.h +++ b/include/linux/dma-mapping.h @@ -154,6 +154,25 @@ static inline unsigned int dma_set_max_seg_size(struct device *dev, return -EIO; } +#define DMA_SEGMENTS_MAX_SEG_COUNT ((unsigned int) INT_MAX)
+static inline unsigned int dma_get_max_seg_count(struct device *dev) +{
- if (dev->dma_parms && dev->dma_parms->max_segment_count)
return dev->dma_parms->max_segment_count;
- return DMA_SEGMENTS_MAX_SEG_COUNT;
+}
+static inline int dma_set_max_seg_count(struct device *dev,
unsigned int count)
+{
- if (dev->dma_parms) {
dev->dma_parms->max_segment_count = count;
return 0;
- }
- return -EIO;
+}
static inline unsigned long dma_get_seg_boundary(struct device *dev) { return dev->dma_parms ?
linaro-kernel@lists.linaro.org