Hi
I'd like to submit for consideration and discussion the following proposed backend API design to address some of the current limitations regarding excessive mem copies and sub-optimal memory behavior in Arm NN. This design also lays the foundation for future roadmap items to address protected content and affects backend authors only.
One open question which I would like feedback on is "how important is backward compatibility and stability of this backend API?". I believe it should be possible to keep existing backends working though it would be far simpler from an implementation and testing perspective if we could implement this in an API breaking way. Of course, if this is unacceptable for the community we will endeavor to maintain the current API (though deprecated) along side the new API for at least one release cycle. As the API matures, I expect these type of intrusive changes to become far less common.
...
So why change the API?
The current design requires that all tensors are allocated by the backend which executes the workload. The workload inputs and outputs are allocated by the backend via the workload factory interface. In order for inter-backend compatibility to work, all TensorHandles are required to implement Map/UnMap methods which expose the raw CPU accessible pointer. A standard mem copy is then applied to copy the data from one tensor type to another using these mapped tensors. This copy is even performed in situations where different backends could potentially use the same TensorHandle type making the mem copy redundant. The current mechanism is not sufficient to cover all the multiple types of heaps that may be available on a system or the different usage patterns required for optimal performance.
What follows is a design which should enable the ArmNN framework to minimize the number of mem copies required when transitioning between different backends while also allowing backends to use their optimal heaps while maintaining compatibility and correct functionality.
Design
There are two aspects to this design:
a mechanism to query tensor compatibility between backends a mechanism to select and allocate the best compatible tensor type.
TensorHandle Factory
This design introduces a new interface class ITensorHandleFactory which exposes the following methods:
virtual std::unique_ptr<ITensorHandle> CreateSubTensorHandle(ITensorHandle& parent, TensorShape const& subTensorShape, unsigned int const* subTensorOrigin) const = 0; virtual std::unique_ptr<ITensorHandle> CreateTensorHandle(const TensorInfo& tensorInfo) const = 0; virtual const FactoryId GetId() const = 0;
These methods are currently located on the IWorkloadFactory interface. By moving this interface onto a new dedicated class, it becomes possible for backends to implement multiple factories, each with different TensorHandle properties.
FactoryId
Each TensorHandleFactory has a globally unique identifier string. This should take the form of "VendorId/BackendName/FactoryName".
Multiple factories
It should be possible for a backend to support multiple TensorHandle types, each with different access properties. For example, a discreet GPU might have GPU memory tensors (which are not mappable but provide fast read/write access by the GPU) and staging Tensors (which are mappable and slower access). In this scenario, the framework should use the GPU tensors between workloads which execute on the GPU, and staging Tensors which transition between the GPU and another backend.
Another scenario where this would be useful is for vendors with proprietary formats/compression/layout where these tensors would not be compatible with other backends. The current design cannot support these easily.
TensorHandleFactoryRegistry
Each backend will register its TensorHandleFactory objects as well as any IMemoryManager objects they might require. There is a new method on the IBackendInternal interface which backend authors need to implement.
virtual void RegisterTensorHandleFactories(class TensorHandleFactoryRegistry& registry) {}
The implementation of this method needs to create the concrete factory and memory manager instances and register them via the following methods on the ITensorHandleFactoryRegistry parameter object.
void RegisterFactory(std::unique_ptr<ITensorHandleFactory> factory); void RegisterMemoryManager(std::weak_ptr<IMemoryManager> memoryManger);
Note: The registry currently takes ownership of the factories but only keeps a weak ptr to the memory manager. The exact detail of this interface is not final and could change regarding ownership.
TensorHandleFactory preferences
In some scenarios, such as on a system with a Unified Memory Architectures and compatible APIs, it might be possible for two different backends to be able to access Tensors of the same TensorHandle type. For example, The CpuAcc (Neon) backend can work just as well using tensors allocated by the GpuAcc (CL) backend. In order to support this in a generic way the backend will be able to report a list of known TensorHandleFactory instances that it is compatible with. To support this, the following method is added to the IBackendInternal interface.
virtual std::vectorITensorHandleFactory::FactoryId GetHandleFactoryPreferences() const = 0;
This method should return, in preference order, the FactoryId of any factories (including its own) with which the backend is compatible. The ranking is in the order from highest performance to highest compatibility.
In the discreet GPU example, the GPU only tensor factory would be first on the list and the tensor factory which supports Map/Unmap would be second.
TensorHandleFactory properties
There will be additional methods on this ITensorHandleFactory interface to query the properties of the TensorHandles allocated by the factory (exact API TBD). These properties will be queried by the Optimizer when coming up with a tensor handle strategy for "optimal performance".
Some example properties might be:
SupportsSubTensors - Equivalent to existing functionality on the IWorkloadFactory SupportsMapUnmap - Currently Map/Unmap support is required however this will likely become optional in the future. SupportsMemoryImport - The mem copy of inputs could be removed for scenarios where TensorHandles can import externally allocated memory. SupportsMemoryExport - The mem copy between different backends could be removed for scenarios where the two backends support memory export and memory import respectively.
The framework will use these properties to determine the best strategy for allocation (ie which factory to use or when to insert memcopies) and to identify unsupported/invalid scenarios (ie no compatible factories found).
MemoryTypes
For memory import and export scenarios, we will limit this to CPU addressable memory for this initial implementation. In the future we can add support for import from Dma_buf or IonBuffer and even protected DmaBuf.
...
I hope you'll agree that this design opens a lot of potential for improved flexibility and performance. I look forward to further discussions on this subject.
Kind regards,
Derek