Hi
I'd like to submit for consideration and discussion the following proposed backend API design to address some of the current limitations regarding excessive mem copies and sub-optimal memory behavior in Arm NN. This design also lays the foundation for future roadmap items to address protected content and affects backend authors only.
One open question which I would like feedback on is "how important is backward compatibility and stability of this backend API?". I believe it should be possible to keep existing backends working though it would be far simpler from an implementation and testing perspective if we could implement this in an API breaking way. Of course, if this is unacceptable for the community we will endeavor to maintain the current API (though deprecated) along side the new API for at least one release cycle. As the API matures, I expect these type of intrusive changes to become far less common.
...
So why change the API?
The current design requires that all tensors are allocated by the backend which executes the workload. The workload inputs and outputs are allocated by the backend via the workload factory interface. In order for inter-backend compatibility to work, all TensorHandles are required to implement Map/UnMap methods which expose the raw CPU accessible pointer. A standard mem copy is then applied to copy the data from one tensor type to another using these mapped tensors. This copy is even performed in situations where different backends could potentially use the same TensorHandle type making the mem copy redundant. The current mechanism is not sufficient to cover all the multiple types of heaps that may be available on a system or the different usage patterns required for optimal performance.
What follows is a design which should enable the ArmNN framework to minimize the number of mem copies required when transitioning between different backends while also allowing backends to use their optimal heaps while maintaining compatibility and correct functionality.
Design
There are two aspects to this design:
a mechanism to query tensor compatibility between backends a mechanism to select and allocate the best compatible tensor type.
TensorHandle Factory
This design introduces a new interface class ITensorHandleFactory which exposes the following methods:
virtual std::unique_ptr<ITensorHandle> CreateSubTensorHandle(ITensorHandle& parent, TensorShape const& subTensorShape, unsigned int const* subTensorOrigin) const = 0; virtual std::unique_ptr<ITensorHandle> CreateTensorHandle(const TensorInfo& tensorInfo) const = 0; virtual const FactoryId GetId() const = 0;
These methods are currently located on the IWorkloadFactory interface. By moving this interface onto a new dedicated class, it becomes possible for backends to implement multiple factories, each with different TensorHandle properties.
FactoryId
Each TensorHandleFactory has a globally unique identifier string. This should take the form of "VendorId/BackendName/FactoryName".
Multiple factories
It should be possible for a backend to support multiple TensorHandle types, each with different access properties. For example, a discreet GPU might have GPU memory tensors (which are not mappable but provide fast read/write access by the GPU) and staging Tensors (which are mappable and slower access). In this scenario, the framework should use the GPU tensors between workloads which execute on the GPU, and staging Tensors which transition between the GPU and another backend.
Another scenario where this would be useful is for vendors with proprietary formats/compression/layout where these tensors would not be compatible with other backends. The current design cannot support these easily.
TensorHandleFactoryRegistry
Each backend will register its TensorHandleFactory objects as well as any IMemoryManager objects they might require. There is a new method on the IBackendInternal interface which backend authors need to implement.
virtual void RegisterTensorHandleFactories(class TensorHandleFactoryRegistry& registry) {}
The implementation of this method needs to create the concrete factory and memory manager instances and register them via the following methods on the ITensorHandleFactoryRegistry parameter object.
void RegisterFactory(std::unique_ptr<ITensorHandleFactory> factory); void RegisterMemoryManager(std::weak_ptr<IMemoryManager> memoryManger);
Note: The registry currently takes ownership of the factories but only keeps a weak ptr to the memory manager. The exact detail of this interface is not final and could change regarding ownership.
TensorHandleFactory preferences
In some scenarios, such as on a system with a Unified Memory Architectures and compatible APIs, it might be possible for two different backends to be able to access Tensors of the same TensorHandle type. For example, The CpuAcc (Neon) backend can work just as well using tensors allocated by the GpuAcc (CL) backend. In order to support this in a generic way the backend will be able to report a list of known TensorHandleFactory instances that it is compatible with. To support this, the following method is added to the IBackendInternal interface.
virtual std::vectorITensorHandleFactory::FactoryId GetHandleFactoryPreferences() const = 0;
This method should return, in preference order, the FactoryId of any factories (including its own) with which the backend is compatible. The ranking is in the order from highest performance to highest compatibility.
In the discreet GPU example, the GPU only tensor factory would be first on the list and the tensor factory which supports Map/Unmap would be second.
TensorHandleFactory properties
There will be additional methods on this ITensorHandleFactory interface to query the properties of the TensorHandles allocated by the factory (exact API TBD). These properties will be queried by the Optimizer when coming up with a tensor handle strategy for "optimal performance".
Some example properties might be:
SupportsSubTensors - Equivalent to existing functionality on the IWorkloadFactory SupportsMapUnmap - Currently Map/Unmap support is required however this will likely become optional in the future. SupportsMemoryImport - The mem copy of inputs could be removed for scenarios where TensorHandles can import externally allocated memory. SupportsMemoryExport - The mem copy between different backends could be removed for scenarios where the two backends support memory export and memory import respectively.
The framework will use these properties to determine the best strategy for allocation (ie which factory to use or when to insert memcopies) and to identify unsupported/invalid scenarios (ie no compatible factories found).
MemoryTypes
For memory import and export scenarios, we will limit this to CPU addressable memory for this initial implementation. In the future we can add support for import from Dma_buf or IonBuffer and even protected DmaBuf.
...
I hope you'll agree that this design opens a lot of potential for improved flexibility and performance. I look forward to further discussions on this subject.
Kind regards,
Derek
Hi Derek,
This topic is really interesting, and I have some comments inline. Please help correct if I'm wrong.
On Sat, 8 Jun 2019 at 00:59, Derek Lamberti derek.lamberti@linaro.org wrote:
Hi
I'd like to submit for consideration and discussion the following proposed backend API design to address some of the current limitations regarding excessive mem copies and sub-optimal memory behavior in Arm NN. This design also lays the foundation for future roadmap items to address protected content and affects backend authors only.
One open question which I would like feedback on is "how important is backward compatibility and stability of this backend API?". I believe it should be possible to keep existing backends working though it would be far simpler from an implementation and testing perspective if we could implement this in an API breaking way. Of course, if this is unacceptable for the community we will endeavor to maintain the current API (though deprecated) along side the new API for at least one release cycle. As the API matures, I expect these type of intrusive changes to become far less common.
At current stage, IMHO it is more important to agree on the backend API itself first with vendors working on their backends than considering the compatibility, which can be the next step.
...
So why change the API?
The current design requires that all tensors are allocated by the backend which executes the workload. The workload inputs and outputs are allocated by the backend via the workload factory interface. In order for inter-backend compatibility to work, all TensorHandles are required to implement Map/UnMap methods which expose the raw CPU accessible pointer. A standard mem copy is then applied to copy the data from one tensor type to another using these mapped tensors. This copy is even performed in situations where different backends could potentially use the same TensorHandle type making the mem copy redundant. The current mechanism is not sufficient to cover all the multiple types of heaps that may be available on a system or the different usage patterns required for optimal performance.
Shall we generalize the memory copy interface as well? Currently memcpy is used by default with CPU operations, but I think DMA copy can be more efficient sometimes (especially for local device memory).
What follows is a design which should enable the ArmNN framework to minimize the number of mem copies required when transitioning between different backends while also allowing backends to use their optimal heaps while maintaining compatibility and correct functionality.
Design
There are two aspects to this design:
a mechanism to query tensor compatibility between backends a mechanism to select and allocate the best compatible tensor type.
TensorHandle Factory
This design introduces a new interface class ITensorHandleFactory which exposes the following methods:
virtual std::unique_ptr<ITensorHandle> CreateSubTensorHandle(ITensorHandle& parent, TensorShape const& subTensorShape, unsigned int const* subTensorOrigin) const = 0; virtual std::unique_ptr<ITensorHandle> CreateTensorHandle(const TensorInfo& tensorInfo) const = 0; virtual const FactoryId GetId() const = 0;
These methods are currently located on the IWorkloadFactory interface. By moving this interface onto a new dedicated class, it becomes possible for backends to implement multiple factories, each with different TensorHandle properties.
FactoryId
Each TensorHandleFactory has a globally unique identifier string. This should take the form of "VendorId/BackendName/FactoryName".
Multiple factories
It should be possible for a backend to support multiple TensorHandle types, each with different access properties. For example, a discreet GPU might have GPU memory tensors (which are not mappable but provide fast read/write access by the GPU) and staging Tensors (which are mappable and slower access). In this scenario, the framework should use the GPU tensors between workloads which execute on the GPU, and staging Tensors which transition between the GPU and another backend.
Another scenario where this would be useful is for vendors with proprietary formats/compression/layout where these tensors would not be compatible with other backends. The current design cannot support these easily.
How can it be supported with the new design? I would assume some additional interfaces will be required to support such kind of compatibility.
TensorHandleFactoryRegistry
Each backend will register its TensorHandleFactory objects as well as any IMemoryManager objects they might require. There is a new method on the IBackendInternal interface which backend authors need to implement.
virtual void RegisterTensorHandleFactories(class TensorHandleFactoryRegistry& registry) {}
The implementation of this method needs to create the concrete factory and memory manager instances and register them via the following methods on the ITensorHandleFactoryRegistry parameter object.
void RegisterFactory(std::unique_ptr<ITensorHandleFactory> factory); void RegisterMemoryManager(std::weak_ptr<IMemoryManager> memoryManger);
Note: The registry currently takes ownership of the factories but only keeps a weak ptr to the memory manager. The exact detail of this interface is not final and could change regarding ownership.
TensorHandleFactory preferences
In some scenarios, such as on a system with a Unified Memory Architectures and compatible APIs, it might be possible for two different backends to be able to access Tensors of the same TensorHandle type. For example, The CpuAcc (Neon) backend can work just as well using tensors allocated by the GpuAcc (CL) backend. In order to support this in a generic way the backend will be able to report a list of known TensorHandleFactory instances that it is compatible with. To support this, the following method is added to the IBackendInternal interface.
virtual std::vectorITensorHandleFactory::FactoryId GetHandleFactoryPreferences() const = 0;
This method should return, in preference order, the FactoryId of any factories (including its own) with which the backend is compatible. The ranking is in the order from highest performance to highest compatibility.
In the discreet GPU example, the GPU only tensor factory would be first on the list and the tensor factory which supports Map/Unmap would be second.
TensorHandleFactory properties
There will be additional methods on this ITensorHandleFactory interface to query the properties of the TensorHandles allocated by the factory (exact API TBD). These properties will be queried by the Optimizer when coming up with a tensor handle strategy for "optimal performance".
Some example properties might be:
SupportsSubTensors - Equivalent to existing functionality on the IWorkloadFactory SupportsMapUnmap - Currently Map/Unmap support is required however this will likely become optional in the future. SupportsMemoryImport - The mem copy of inputs could be removed for scenarios where TensorHandles can import externally allocated memory. SupportsMemoryExport - The mem copy between different backends could be removed for scenarios where the two backends support memory export and memory import respectively.
The framework will use these properties to determine the best strategy for allocation (ie which factory to use or when to insert memcopies) and to identify unsupported/invalid scenarios (ie no compatible factories found).
MemoryTypes
For memory import and export scenarios, we will limit this to CPU addressable memory for this initial implementation. In the future we can add support for import from Dma_buf or IonBuffer and even protected DmaBuf.
Shall we have the interface defined first for import/export? We can use CPU addressable memory for initial implementation, and developers in the community who have interest may play with dmabuf/ION with the new interface as well.
...
I hope you'll agree that this design opens a lot of potential for improved flexibility and performance. I look forward to further discussions on this subject.
Kind regards,
Derek _______________________________________________ Armnn-dev mailing list Armnn-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/armnn-dev
Hi Jammy,
Please see my inline response below.
On Mon, 17 Jun 2019 at 03:08, Jammy Zhou jammy.zhou@linaro.org wrote:
Hi Derek,
This topic is really interesting, and I have some comments inline. Please help correct if I'm wrong.
On Sat, 8 Jun 2019 at 00:59, Derek Lamberti derek.lamberti@linaro.org wrote:
Hi
I'd like to submit for consideration and discussion the following proposed backend API design to address some of the current limitations regarding excessive mem copies and sub-optimal memory behavior in Arm NN. This design also lays the foundation for future roadmap items to address protected content and affects backend authors only.
One open question which I would like feedback on is "how important is backward compatibility and stability of this backend API?". I believe it should be possible to keep existing backends working though it would be far simpler from an implementation and testing perspective if we could implement this in an API breaking way. Of course, if this is unacceptable for the community we will endeavor to maintain the current API (though deprecated) along side the new API for at least one release cycle. As the API matures, I expect these type of intrusive changes to become far less common.
At current stage, IMHO it is more important to agree on the backend API itself first with vendors working on their backends than considering the compatibility, which can be the next step.
...
So why change the API?
The current design requires that all tensors are allocated by the backend which executes the workload. The workload inputs and outputs are allocated by the backend via the workload factory interface. In order for inter-backend compatibility to work, all TensorHandles are required to implement Map/UnMap methods which expose the raw CPU accessible pointer. A standard mem copy is then applied to copy the data from one tensor type to another using these mapped tensors. This copy is even performed in situations where different backends could potentially use the same TensorHandle type making the mem copy redundant. The current mechanism is not sufficient to cover all the multiple types of heaps that may be available on a system or the different usage patterns required for optimal performance.
Shall we generalize the memory copy interface as well? Currently memcpy is used by default with CPU operations, but I think DMA copy can be more efficient sometimes (especially for local device memory).
In principle I agree. A DMA option would be very useful and is something we should look to add as one of the strategies for moving data around. I think the current memcopy workloads should remain unchanged as it is there to provide the "max compatibility" option (though eventually not created via workload factories but rather as an ArmNN core workload), however for backends which do support some form of DMA there should be a mechanism to accelerate that. Currently I'm thinking of introducing two new workload types. DmaToWorkload (which reads from a mappable ITensorHandle) and DmaFromWorkload (which writes to a mappable tensorhandle). ** exact naming TBD. These workloads would be created by the workload factory belonging to the backend which provides the factories for the From and To tensors. The ArmNN framework would handled this via one of the "Memory Strategy" options selected during optimization. The assumption here is that the Dma workloads could read/write directly to/from "proprietary" tensor handle types. Ie ones with a custom internal data layout/compression format.
What follows is a design which should enable the ArmNN framework to minimize the number of mem copies required when transitioning between different backends while also allowing backends to use their optimal heaps while maintaining compatibility and correct functionality.
Design
There are two aspects to this design:
a mechanism to query tensor compatibility between backends a mechanism to select and allocate the best compatible tensor type.
TensorHandle Factory
This design introduces a new interface class ITensorHandleFactory which exposes the following methods:
virtual std::unique_ptr<ITensorHandle> CreateSubTensorHandle(ITensorHandle& parent, TensorShape const& subTensorShape, unsigned int const* subTensorOrigin) const = 0; virtual std::unique_ptr<ITensorHandle> CreateTensorHandle(const TensorInfo& tensorInfo) const = 0; virtual const FactoryId GetId() const = 0;
These methods are currently located on the IWorkloadFactory interface. By moving this interface onto a new dedicated class, it becomes possible for backends to implement multiple factories, each with different TensorHandle properties.
FactoryId
Each TensorHandleFactory has a globally unique identifier string. This should take the form of "VendorId/BackendName/FactoryName".
Multiple factories
It should be possible for a backend to support multiple TensorHandle types, each with different access properties. For example, a discreet GPU might have GPU memory tensors (which are not mappable but provide fast read/write access by the GPU) and staging Tensors (which are mappable and slower access). In this scenario, the framework should use the GPU tensors between workloads which execute on the GPU, and staging Tensors which transition between the GPU and another backend.
Another scenario where this would be useful is for vendors with proprietary formats/compression/layout where these tensors would not be compatible with other backends. The current design cannot support these easily.
How can it be supported with the new design? I would assume some additional interfaces will be required to support such kind of compatibility.
In the proposed design, this would be supported as follows: 1. Map/Unmap interface becomes optional. For tensors handle factories with proprietary data layouts/compression, map/unmap is reported as NOT supported. These kinds of tensors factories would then only be used by the framework when going between workloads from the same backend or when the factories are directly compatible (ie in the preferred tensor handle factory list on both backends). 2. The backend reports these "proprietary" tensor handle types as their most preferred tensors. The ArmNN framework will simply not select these tensors handle factories in cases where it determines that it need Map/Unmap or Import/Export support. 3. The key is that data exposed by Map/Unmap or Import/Export needs to be in a format known to ArmNN and other backends.
TensorHandleFactoryRegistry
Each backend will register its TensorHandleFactory objects as well as any IMemoryManager objects they might require. There is a new method on the IBackendInternal interface which backend authors need to implement.
virtual void RegisterTensorHandleFactories(class TensorHandleFactoryRegistry& registry) {}
The implementation of this method needs to create the concrete factory and memory manager instances and register them via the following methods on the ITensorHandleFactoryRegistry parameter object.
void RegisterFactory(std::unique_ptr<ITensorHandleFactory> factory); void RegisterMemoryManager(std::weak_ptr<IMemoryManager> memoryManger);
Note: The registry currently takes ownership of the factories but only keeps a weak ptr to the memory manager. The exact detail of this interface is not final and could change regarding ownership.
TensorHandleFactory preferences
In some scenarios, such as on a system with a Unified Memory Architectures and compatible APIs, it might be possible for two different backends to be able to access Tensors of the same TensorHandle type. For example, The CpuAcc (Neon) backend can work just as well using tensors allocated by the GpuAcc (CL) backend. In order to support this in a generic way the backend will be able to report a list of known TensorHandleFactory instances that it is compatible with. To support this, the following method is added to the IBackendInternal interface.
virtual std::vectorITensorHandleFactory::FactoryId GetHandleFactoryPreferences() const = 0;
This method should return, in preference order, the FactoryId of any factories (including its own) with which the backend is compatible. The ranking is in the order from highest performance to highest compatibility.
In the discreet GPU example, the GPU only tensor factory would be first on the list and the tensor factory which supports Map/Unmap would be second.
TensorHandleFactory properties
There will be additional methods on this ITensorHandleFactory interface to query the properties of the TensorHandles allocated by the factory (exact API TBD). These properties will be queried by the Optimizer when coming up with a tensor handle strategy for "optimal performance".
Some example properties might be:
SupportsSubTensors - Equivalent to existing functionality on the IWorkloadFactory SupportsMapUnmap - Currently Map/Unmap support is required however this will likely become optional in the future. SupportsMemoryImport - The mem copy of inputs could be removed for scenarios where TensorHandles can import externally allocated memory. SupportsMemoryExport - The mem copy between different backends could be removed for scenarios where the two backends support memory export and memory import respectively.
The framework will use these properties to determine the best strategy for allocation (ie which factory to use or when to insert memcopies) and to identify unsupported/invalid scenarios (ie no compatible factories found).
MemoryTypes
For memory import and export scenarios, we will limit this to CPU addressable memory for this initial implementation. In the future we can add support for import from Dma_buf or IonBuffer and even protected DmaBuf.
Shall we have the interface defined first for import/export? We can use CPU addressable memory for initial implementation, and developers in the community who have interest may play with dmabuf/ION with the new interface as well.
I would like to treat the import/export functionality as an extension of the TensorHandle API rather as something we need to have in detail in the initial design. So long as the Tensor Handle Factory API design is fit for purpose in principle with all our memory import/export requirements, the exact API is something we can determine later. To be clear, I realize import/export is the "headline feature" partners are looking forward to, so I understand and I agree that it is important. I simply would like to make sure that we get a solid foundation first rather than jumping straight to the end goal and risking ending up with something that doesn't interact well with the rest of the system.
That being said I'm happy to continue discussing import/export requirements in this thread to make sure we do consider viability of this design even if they're not directly addressed by the initial implementation.
On that note, I expect we will want to ensure suitability to work with CPU Mem, DMA_BUF(protected and unprotected) , Ion Buffers. So would an import API which takes a void* and an enumeration/flag describing the source (Cpu, DMA, ION etc) be sufficient? This would also mean the Tensor Handle Factory interface to query Import/Export support needs to report explicitly which sources it supports. Also, it should be possible to support Import from 1 or more sources while not supporting Export so the interface must not conflate the two. Is this a sufficient description of the requirements and does void* + source flags provided enough information to work for our use cases?
...
I hope you'll agree that this design opens a lot of potential for improved flexibility and performance. I look forward to further discussions on this subject.
Kind regards,
Derek _______________________________________________ Armnn-dev mailing list Armnn-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/armnn-dev
Some minor comments inline
On Thu, 20 Jun 2019 at 19:03, Derek Lamberti derek.lamberti@linaro.org wrote:
Hi Jammy,
Please see my inline response below.
On Mon, 17 Jun 2019 at 03:08, Jammy Zhou jammy.zhou@linaro.org wrote:
Hi Derek,
This topic is really interesting, and I have some comments inline.
Please help correct if I'm wrong.
On Sat, 8 Jun 2019 at 00:59, Derek Lamberti derek.lamberti@linaro.org
wrote:
Hi
I'd like to submit for consideration and discussion the following proposed backend API design to address some of the current limitations regarding excessive mem copies and sub-optimal memory behavior in Arm NN. This design also lays the foundation for future roadmap items to address protected content and affects backend authors only.
One open question which I would like feedback on is "how important is backward compatibility and stability of this backend API?". I believe it should be possible to keep existing backends working though it would be far simpler from an implementation and testing perspective if we could implement this in an API breaking way. Of course, if this is unacceptable for the community we will endeavor to maintain the current API (though deprecated) along side the new API for at least one release cycle. As the API matures, I expect these type of intrusive changes to become far less common.
At current stage, IMHO it is more important to agree on the backend API
itself first with vendors working on their backends than considering the compatibility, which can be the next step.
...
So why change the API?
The current design requires that all tensors are allocated by the backend which executes the workload. The workload inputs and outputs are allocated by the backend via the workload factory interface. In order for inter-backend compatibility to work, all TensorHandles are required to implement Map/UnMap methods which expose the raw CPU accessible pointer. A standard mem copy is then applied to copy the data from one tensor type to another using these mapped tensors. This copy is even performed in situations where different backends could potentially use the same TensorHandle type making the mem copy redundant. The current mechanism is not sufficient to cover all the multiple types of heaps that may be available on a system or the different usage patterns required for optimal performance.
Shall we generalize the memory copy interface as well? Currently memcpy
is used by default with CPU operations, but I think DMA copy can be more efficient sometimes (especially for local device memory).
In principle I agree. A DMA option would be very useful and is something we should look to add as one of the strategies for moving data around. I think the current memcopy workloads should remain unchanged as it is there to provide the "max compatibility" option (though eventually not created via workload factories but rather as an ArmNN core workload), however for backends which do support some form of DMA there should be a mechanism to accelerate that. Currently I'm thinking of introducing two new workload types. DmaToWorkload (which reads from a mappable ITensorHandle) and DmaFromWorkload (which writes to a mappable tensorhandle). ** exact naming TBD. These workloads
would be created by the workload factory belonging to the backend
which provides the factories for the From and To tensors. The ArmNN framework would handled this via one of the "Memory Strategy" options selected during optimization. The assumption here is that the Dma workloads could read/write directly to/from "proprietary" tensor handle types. Ie ones with a custom internal data layout/compression format.
Looks good to me.
What follows is a design which should enable the ArmNN framework to minimize the number of mem copies required when transitioning between different backends while also allowing backends to use their optimal heaps while maintaining compatibility and correct functionality.
Design
There are two aspects to this design:
a mechanism to query tensor compatibility between backends a mechanism to select and allocate the best compatible tensor type.
TensorHandle Factory
This design introduces a new interface class ITensorHandleFactory which exposes the following methods:
virtual std::unique_ptr<ITensorHandle> CreateSubTensorHandle(ITensorHandle& parent, TensorShape const& subTensorShape, unsigned int const* subTensorOrigin) const = 0; virtual std::unique_ptr<ITensorHandle> CreateTensorHandle(const TensorInfo& tensorInfo) const = 0; virtual const FactoryId GetId() const = 0;
These methods are currently located on the IWorkloadFactory interface. By moving this interface onto a new dedicated class, it becomes possible for backends to implement multiple factories, each with different TensorHandle properties.
FactoryId
Each TensorHandleFactory has a globally unique identifier string. This should take the form of "VendorId/BackendName/FactoryName".
Multiple factories
It should be possible for a backend to support multiple TensorHandle types, each with different access properties. For example, a discreet GPU might have GPU memory tensors (which are not mappable but provide fast read/write access by the GPU) and staging Tensors (which are mappable and slower access). In this scenario, the framework should use the GPU tensors between workloads which execute on the GPU, and staging Tensors which transition between the GPU and another backend.
Another scenario where this would be useful is for vendors with proprietary formats/compression/layout where these tensors would not be compatible with other backends. The current design cannot support these easily.
How can it be supported with the new design? I would assume some
additional interfaces will be required to support such kind of compatibility.
In the proposed design, this would be supported as follows:
- Map/Unmap interface becomes optional. For tensors handle factories
with proprietary data layouts/compression, map/unmap is reported as NOT supported. These kinds of tensors factories would then only be used by the framework when going between workloads from the same backend or when the factories are directly compatible (ie in the preferred tensor handle factory list on both backends). 2. The backend reports these "proprietary" tensor handle types as their most preferred tensors. The ArmNN framework will simply not select these tensors handle factories in cases where it determines that it need Map/Unmap or Import/Export support.
3. The key is that data exposed by Map/Unmap or Import/Export needs to
be in a format known to ArmNN and other backends.
I'm thinking for the proprietary data layouts/compression, shall we provide some additional backend interface to transform it to some 'known' format?
TensorHandleFactoryRegistry
Each backend will register its TensorHandleFactory objects as well as any IMemoryManager objects they might require. There is a new method on the IBackendInternal interface which backend authors need to implement.
virtual void RegisterTensorHandleFactories(class TensorHandleFactoryRegistry& registry) {}
The implementation of this method needs to create the concrete factory and memory manager instances and register them via the following methods on the ITensorHandleFactoryRegistry parameter object.
void RegisterFactory(std::unique_ptr<ITensorHandleFactory> factory); void RegisterMemoryManager(std::weak_ptr<IMemoryManager> memoryManger);
Note: The registry currently takes ownership of the factories but only keeps a weak ptr to the memory manager. The exact detail of this interface is not final and could change regarding ownership.
TensorHandleFactory preferences
In some scenarios, such as on a system with a Unified Memory Architectures and compatible APIs, it might be possible for two different backends to be able to access Tensors of the same TensorHandle type. For example, The CpuAcc (Neon) backend can work just as well using tensors allocated by the GpuAcc (CL) backend. In order to support this in a generic way the backend will be able to report a list of known TensorHandleFactory instances that it is compatible with. To support this, the following method is added to the IBackendInternal interface.
virtual std::vectorITensorHandleFactory::FactoryId GetHandleFactoryPreferences() const = 0;
This method should return, in preference order, the FactoryId of any factories (including its own) with which the backend is compatible. The ranking is in the order from highest performance to highest compatibility.
In the discreet GPU example, the GPU only tensor factory would be first on the list and the tensor factory which supports Map/Unmap would be second.
TensorHandleFactory properties
There will be additional methods on this ITensorHandleFactory interface to query the properties of the TensorHandles allocated by the factory (exact API TBD). These properties will be queried by the Optimizer when coming up with a tensor handle strategy for "optimal performance".
Some example properties might be:
SupportsSubTensors - Equivalent to existing functionality on the IWorkloadFactory SupportsMapUnmap - Currently Map/Unmap support is required however this will likely become optional in the future. SupportsMemoryImport - The mem copy of inputs could be removed for scenarios where TensorHandles can import externally allocated memory. SupportsMemoryExport - The mem copy between different backends could be removed for scenarios where the two backends support memory export and memory import respectively.
The framework will use these properties to determine the best strategy for allocation (ie which factory to use or when to insert memcopies) and to identify unsupported/invalid scenarios (ie no compatible factories found).
MemoryTypes
For memory import and export scenarios, we will limit this to CPU addressable memory for this initial implementation. In the future we can add support for import from Dma_buf or IonBuffer and even protected DmaBuf.
Shall we have the interface defined first for import/export? We can use
CPU addressable memory for initial implementation, and developers in the community who have interest may play with dmabuf/ION with the new interface as well.
I would like to treat the import/export functionality as an extension of the TensorHandle API rather as something we need to have in detail in the initial design. So long as the Tensor Handle Factory API design is fit for purpose in principle with all our memory import/export requirements, the exact API is something we can determine later. To be clear, I realize import/export is the "headline feature" partners are looking forward to, so I understand and I agree that it is important. I simply would like to make sure that we get a solid foundation first rather than jumping straight to the end goal and risking ending up with something that doesn't interact well with the rest of the system.
That's okay to me.
That being said I'm happy to continue discussing import/export requirements in this thread to make sure we do consider viability of this design even if they're not directly addressed by the initial implementation.
On that note, I expect we will want to ensure suitability to work with CPU Mem, DMA_BUF(protected and unprotected) , Ion Buffers. So would an import API which takes a void* and an enumeration/flag describing the source (Cpu, DMA, ION etc) be sufficient? This would also mean the Tensor Handle Factory interface to query Import/Export support needs to report explicitly which sources it supports. Also, it should be possible to support Import from 1 or more sources while not supporting Export so the interface must not conflate the two. Is this a sufficient description of the requirements and does void* + source flags provided enough information to work for our use cases?
I think if we want to leverage the dmabuf kernel subsystem, a dmabuf fd will be required as part of the parameters. drmPrimeHandleToFD/drmPrimeFDToHandle [1] can probably be referenced. And it should be similar for ION on Android.
[1] https://cgit.freedesktop.org/mesa/drm/tree/xf86drm.c
...
I hope you'll agree that this design opens a lot of potential for improved flexibility and performance. I look forward to further discussions on this subject.
Kind regards,
Derek _______________________________________________ Armnn-dev mailing list Armnn-dev@lists.linaro.org https://lists.linaro.org/mailman/listinfo/armnn-dev