My information is admittedly a bit dated now - but last time I looked into it LLVM still had some performance deficits versus gcc.
Have they caught up now?  Otherwise any performance gains from their JIT inlining technology may be offset by diminished performance overall.

On Fri, Nov 13, 2015 at 10:35 AM, Zoltan Kiss <zoltan.kiss@linaro.org> wrote:


On 13/11/15 16:19, Nikolai Bozhenov wrote:


On 11/06/2015 05:48 PM, Zoltan Kiss wrote:
Hi,

We have a packaging/linking/optimization problem at LNG, I hope you
guys can give us some advice on that. (Cc'ing ODP list in case someone
want to add something)
We have OpenDataPlane (ODP), an API stretching between userspace
applications and hardware SDKs. It's defined in the form of C headers,
and we already have several implementations to face SDKs (or whathever
is actually controlling the hardware), e.g. linux-generic, a DPDK one
etc.
And we have applications, like Open vSwitch (OVS), which now is able
to work with any ODP platform implementation which implements this API
When it comes to packaging, the ideal scenario would be to create one
package for the application, e.g. openvswitch.deb, and one for each
platform, e.g odp-generic.deb, odp-dpdk.deb. The latter would contain
the implementations in the form of a libodp.so file, so the
application can dynamically load the actually installed platform's
library runtime, with all the benefits of dynamic linking.
The trouble is that we have several accessor functions in the API
which are very short and __very__ frequently used. The best example is
"uint32_t odp_packet_len(odp_packet_t pkt)", which returns the length
of the packet. odp_packet_t is an opaque type defined by the
implementation, often a pointer to the packet's actual metadata, so
the actual function call yields to a simple load from that metadata
pointer (+offset). Having it wrapped into a function call brings a
significant performance decrease: when forwarding 64 byte packets at
10 Gbps, I got 13.2 Mpps with function calls. When I've inlined that
function it brought 13.8 Mpps, that's ~5% difference. And there are a
lot of other frequently used short accessor functions with the same
problem.
But obviously if I inline these functions I break the ABI, and I need
to compile the application for each platform (and create packages like
openvswitch-odp-dpdk.deb, containing the platform statically linked).
I've tried to look around on Google and in gcc manual, but I couldn't
find a good solution for this kind of problem.
I've checked link time optimization (-flto), but it only helps with
static linking. Is there any way to keep the ODP application and
platform implementation binaries in separate files while having the
performance benefit of inlining?

Regards,

Zoltan


Hi!

If all you need is to have fast and portable binary, I wonder if you
could use relocations to attain your goal. I mean, to make the dynamic
linker overwrite at startup time the call instructions with some
machine specific absolute values. E.g. with 0xe590400c which is the
binary representation of the 'ldr r0, [r0, #12]' instruction and which
seems to be fully equivalent to the call to odp_packet_len.

Something like that would be the best, yes, but it seems gcc and friends doesn't support that. As others said, probably LLVM has a JIT which can do that.

Another slightly different option is to inline the call at build time,
but to have the dynamic linker patch some inlined instructions at
startup time. For example, to write correct offsets into the load
instructions.

Nikolai

_______________________________________________
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp