Hi
I'd like to submit for consideration and discussion the following
proposed backend API design to address some of the current limitations
regarding excessive mem copies and sub-optimal memory behavior in Arm
NN. This design also lays the foundation for future roadmap items to
address protected content and affects backend authors only.
One open question which I would like feedback on is "how important is
backward compatibility and stability of this backend API?". I believe
it should be possible to keep existing backends working though it
would be far simpler from an implementation and testing perspective if
we could implement this in an API breaking way. Of course, if this is
unacceptable for the community we will endeavor to maintain the
current API (though deprecated) along side the new API for at least
one release cycle. As the API matures, I expect these type of
intrusive changes to become far less common.
...
So why change the API?
The current design requires that all tensors are allocated by the
backend which executes the workload. The workload inputs and outputs
are allocated by the backend via the workload factory interface. In
order for inter-backend compatibility to work, all TensorHandles are
required to implement Map/UnMap methods which expose the raw CPU
accessible pointer. A standard mem copy is then applied to copy the
data from one tensor type to another using these mapped tensors. This
copy is even performed in situations where different backends could
potentially use the same TensorHandle type making the mem copy
redundant. The current mechanism is not sufficient to cover all the
multiple types of heaps that may be available on a system or the
different usage patterns required for optimal performance.
What follows is a design which should enable the ArmNN framework to
minimize the number of mem copies required when transitioning between
different backends while also allowing backends to use their optimal
heaps while maintaining compatibility and correct functionality.
Design
There are two aspects to this design:
a mechanism to query tensor compatibility between backends
a mechanism to select and allocate the best compatible tensor type.
TensorHandle Factory
This design introduces a new interface class ITensorHandleFactory
which exposes the following methods:
virtual std::unique_ptr<ITensorHandle>
CreateSubTensorHandle(ITensorHandle& parent, TensorShape const&
subTensorShape, unsigned int const* subTensorOrigin) const = 0;
virtual std::unique_ptr<ITensorHandle> CreateTensorHandle(const
TensorInfo& tensorInfo) const = 0;
virtual const FactoryId GetId() const = 0;
These methods are currently located on the IWorkloadFactory interface.
By moving this interface onto a new dedicated class, it becomes
possible for backends to implement multiple factories, each with
different TensorHandle properties.
FactoryId
Each TensorHandleFactory has a globally unique identifier string. This
should take the form of "VendorId/BackendName/FactoryName".
Multiple factories
It should be possible for a backend to support multiple TensorHandle
types, each with different access properties. For example, a discreet
GPU might have GPU memory tensors (which are not mappable but provide
fast read/write access by the GPU) and staging Tensors (which are
mappable and slower access). In this scenario, the framework should
use the GPU tensors between workloads which execute on the GPU, and
staging Tensors which transition between the GPU and another backend.
Another scenario where this would be useful is for vendors with
proprietary formats/compression/layout where these tensors would not
be compatible with other backends. The current design cannot support
these easily.
TensorHandleFactoryRegistry
Each backend will register its TensorHandleFactory objects as well as
any IMemoryManager objects they might require. There is a new method
on the IBackendInternal interface which backend authors need to
implement.
virtual void RegisterTensorHandleFactories(class
TensorHandleFactoryRegistry& registry) {}
The implementation of this method needs to create the concrete factory
and memory manager instances and register them via the following
methods on the ITensorHandleFactoryRegistry parameter object.
void RegisterFactory(std::unique_ptr<ITensorHandleFactory> factory);
void RegisterMemoryManager(std::weak_ptr<IMemoryManager> memoryManger);
Note: The registry currently takes ownership of the factories but only
keeps a weak ptr to the memory manager. The exact detail of this
interface is not final and could change regarding ownership.
TensorHandleFactory preferences
In some scenarios, such as on a system with a Unified Memory
Architectures and compatible APIs, it might be possible for two
different backends to be able to access Tensors of the same
TensorHandle type. For example, The CpuAcc (Neon) backend can work
just as well using tensors allocated by the GpuAcc (CL) backend. In
order to support this in a generic way the backend will be able to
report a list of known TensorHandleFactory instances that it is
compatible with. To support this, the following method is added to the
IBackendInternal interface.
virtual std::vector<ITensorHandleFactory::FactoryId>
GetHandleFactoryPreferences() const = 0;
This method should return, in preference order, the FactoryId of any
factories (including its own) with which the backend is compatible.
The ranking is in the order from highest performance to highest
compatibility.
In the discreet GPU example, the GPU only tensor factory would be
first on the list and the tensor factory which supports Map/Unmap
would be second.
TensorHandleFactory properties
There will be additional methods on this ITensorHandleFactory
interface to query the properties of the TensorHandles allocated by
the factory (exact API TBD). These properties will be queried by the
Optimizer when coming up with a tensor handle strategy for "optimal
performance".
Some example properties might be:
SupportsSubTensors - Equivalent to existing functionality on the
IWorkloadFactory
SupportsMapUnmap - Currently Map/Unmap support is required however
this will likely become optional in the future.
SupportsMemoryImport - The mem copy of inputs could be removed for
scenarios where TensorHandles can import externally allocated memory.
SupportsMemoryExport - The mem copy between different backends could
be removed for scenarios where the two backends support memory export
and memory import respectively.
The framework will use these properties to determine the best strategy
for allocation (ie which factory to use or when to insert memcopies)
and to identify unsupported/invalid scenarios (ie no compatible
factories found).
MemoryTypes
For memory import and export scenarios, we will limit this to CPU
addressable memory for this initial implementation. In the future we
can add support for import from Dma_buf or IonBuffer and even
protected DmaBuf.
...
I hope you'll agree that this design opens a lot of potential for
improved flexibility and performance. I look forward to further
discussions on this subject.
Kind regards,
Derek