Hi,
We have a packaging/linking/optimization problem at LNG, I hope you guys can give us some advice on that. (Cc'ing ODP list in case someone want to add something) We have OpenDataPlane (ODP), an API stretching between userspace applications and hardware SDKs. It's defined in the form of C headers, and we already have several implementations to face SDKs (or whathever is actually controlling the hardware), e.g. linux-generic, a DPDK one etc. And we have applications, like Open vSwitch (OVS), which now is able to work with any ODP platform implementation which implements this API When it comes to packaging, the ideal scenario would be to create one package for the application, e.g. openvswitch.deb, and one for each platform, e.g odp-generic.deb, odp-dpdk.deb. The latter would contain the implementations in the form of a libodp.so file, so the application can dynamically load the actually installed platform's library runtime, with all the benefits of dynamic linking. The trouble is that we have several accessor functions in the API which are very short and __very__ frequently used. The best example is "uint32_t odp_packet_len(odp_packet_t pkt)", which returns the length of the packet. odp_packet_t is an opaque type defined by the implementation, often a pointer to the packet's actual metadata, so the actual function call yields to a simple load from that metadata pointer (+offset). Having it wrapped into a function call brings a significant performance decrease: when forwarding 64 byte packets at 10 Gbps, I got 13.2 Mpps with function calls. When I've inlined that function it brought 13.8 Mpps, that's ~5% difference. And there are a lot of other frequently used short accessor functions with the same problem. But obviously if I inline these functions I break the ABI, and I need to compile the application for each platform (and create packages like openvswitch-odp-dpdk.deb, containing the platform statically linked). I've tried to look around on Google and in gcc manual, but I couldn't find a good solution for this kind of problem. I've checked link time optimization (-flto), but it only helps with static linking. Is there any way to keep the ODP application and platform implementation binaries in separate files while having the performance benefit of inlining?
Regards,
Zoltan
On Fri, Nov 6, 2015 at 6:48 AM, Zoltan Kiss zoltan.kiss@linaro.org wrote:
I've checked link time optimization (-flto), but it only helps with static linking. Is there any way to keep the ODP application and platform implementation binaries in separate files while having the performance benefit of inlining?
I haven't been able to think of a good way to do this, and apparently no one else has either.
There is a not so good way to do it. You could distribute relocatable link (i.e. ld -r) output instead of executables and shared libraries, and then do the final LTO compile and link at run-time. This just creates a new set of problems though. There would be a long delay for the LTO compile and link before you start routing packets which would be inconvenient. It would be better to do the LTO compile and link just once and reuse the binary, but then you have the problem of where do you put the binary and how do you give it proper owner and group permissions. There may also be issues with using ld -r with LTO. You probably don't want this mess.
Otherwise, you need some kind of JIT or rewritable code scheme to redo compiler optimizations at run-time, and we don't have that technology, at least not with gcc. I don't know if LLVM has any useful feature here.
Jim
The IO Visor https://www.iovisor.org/ project appears to be doing something like this with LLVM and JIT constructs to dynamically insert code into the kernel in a platform-independent manner. Perhaps we can leverage that technology?
Bill
On Mon, Nov 9, 2015 at 4:33 PM, Jim Wilson jim.wilson@linaro.org wrote:
On Fri, Nov 6, 2015 at 6:48 AM, Zoltan Kiss zoltan.kiss@linaro.org wrote:
I've checked link time optimization (-flto), but it only helps with
static
linking. Is there any way to keep the ODP application and platform implementation binaries in separate files while having the performance benefit of inlining?
I haven't been able to think of a good way to do this, and apparently no one else has either.
There is a not so good way to do it. You could distribute relocatable link (i.e. ld -r) output instead of executables and shared libraries, and then do the final LTO compile and link at run-time. This just creates a new set of problems though. There would be a long delay for the LTO compile and link before you start routing packets which would be inconvenient. It would be better to do the LTO compile and link just once and reuse the binary, but then you have the problem of where do you put the binary and how do you give it proper owner and group permissions. There may also be issues with using ld -r with LTO. You probably don't want this mess.
Otherwise, you need some kind of JIT or rewritable code scheme to redo compiler optimizations at run-time, and we don't have that technology, at least not with gcc. I don't know if LLVM has any useful feature here.
Jim _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
On Mon, Nov 9, 2015 at 2:39 PM, Bill Fischofer bill.fischofer@linaro.org wrote:
The IO Visor project appears to be doing something like this with LLVM and JIT constructs to dynamically insert code into the kernel in a platform-independent manner. Perhaps we can leverage that technology?
GCC has some experimental JIT support, but I think it would be a lot of work to use it, and I don't know how stable it is. https://gcc.gnu.org/wiki/JIT The LLVM support is probably more advanced.
Jim
Adding Grant Likely to this chain as it relates to the broader subject of portable ABIs that we've been discussing.
On Mon, Nov 9, 2015 at 4:48 PM, Jim Wilson jim.wilson@linaro.org wrote:
On Mon, Nov 9, 2015 at 2:39 PM, Bill Fischofer bill.fischofer@linaro.org wrote:
The IO Visor project appears to be doing something like this with LLVM
and
JIT constructs to dynamically insert code into the kernel in a platform-independent manner. Perhaps we can leverage that technology?
GCC has some experimental JIT support, but I think it would be a lot of work to use it, and I don't know how stable it is. https://gcc.gnu.org/wiki/JIT The LLVM support is probably more advanced.
Jim
JIT like lua might also not work because you need to rewrite OVS to support it. I don't think that it will be accepted.
And it looks like it's problem in OVS, not in ODP. I.e. OVS should allow to use library functions for fast path (where inlines are critical). I.e. not just call odp_packet_len(), but move hole OVS function to dynamic library.
regards, Maxim.
On 10 November 2015 at 02:50, Bill Fischofer bill.fischofer@linaro.org wrote:
Adding Grant Likely to this chain as it relates to the broader subject of portable ABIs that we've been discussing.
On Mon, Nov 9, 2015 at 4:48 PM, Jim Wilson jim.wilson@linaro.org wrote:
On Mon, Nov 9, 2015 at 2:39 PM, Bill Fischofer bill.fischofer@linaro.org wrote:
The IO Visor project appears to be doing something like this with LLVM
and
JIT constructs to dynamically insert code into the kernel in a platform-independent manner. Perhaps we can leverage that technology?
GCC has some experimental JIT support, but I think it would be a lot of work to use it, and I don't know how stable it is. https://gcc.gnu.org/wiki/JIT The LLVM support is probably more advanced.
Jim
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
On 10/11/15 07:39, Maxim Uvarov wrote:
JIT like lua might also not work because you need to rewrite OVS to support it. I don't think that it will be accepted.
And it looks like it's problem in OVS, not in ODP. I.e. OVS should allow to use library functions for fast path (where inlines are critical). I.e. not just call odp_packet_len(), but move hole OVS function to dynamic library.
I'm not sure I get your point here, but OVS allows to use dynamic library functions on fast path. The problem is that it's slow, because of the function call overhead.
regards, Maxim.
On 10 November 2015 at 02:50, Bill Fischofer <bill.fischofer@linaro.org mailto:bill.fischofer@linaro.org> wrote:
Adding Grant Likely to this chain as it relates to the broader subject of portable ABIs that we've been discussing. On Mon, Nov 9, 2015 at 4:48 PM, Jim Wilson <jim.wilson@linaro.org <mailto:jim.wilson@linaro.org>> wrote: On Mon, Nov 9, 2015 at 2:39 PM, Bill Fischofer <bill.fischofer@linaro.org <mailto:bill.fischofer@linaro.org>> wrote: > The IO Visor project appears to be doing something like this with LLVM and > JIT constructs to dynamically insert code into the kernel in a > platform-independent manner. Perhaps we can leverage that technology? GCC has some experimental JIT support, but I think it would be a lot of work to use it, and I don't know how stable it is. https://gcc.gnu.org/wiki/JIT The LLVM support is probably more advanced. Jim _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org> https://lists.linaro.org/mailman/listinfo/lng-odp
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
On 10 November 2015 at 13:41, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 07:39, Maxim Uvarov wrote:
JIT like lua might also not work because you need to rewrite OVS to support it. I don't think that it will be accepted.
And it looks like it's problem in OVS, not in ODP. I.e. OVS should allow to use library functions for fast path (where inlines are critical). I.e. not just call odp_packet_len(), but move hole OVS function to dynamic library.
I'm not sure I get your point here, but OVS allows to use dynamic library functions on fast path. The problem is that it's slow, because of the function call overhead.
I'm not familiar with ovs code. But for example ovs has something like:
ovs_get_and_packet_process() { // here you use some inlines: pkt = odp_recv(); len = odp_packet_len(pkt);
... etc.
}
So it's clear for each target arch you needs it's own variant of ovs_get_and_packet_process() function. That function should go from ovs to dynamic library.
Maxim.
regards, Maxim.
On 10 November 2015 at 02:50, Bill Fischofer <bill.fischofer@linaro.org mailto:bill.fischofer@linaro.org> wrote:
Adding Grant Likely to this chain as it relates to the broader subject of portable ABIs that we've been discussing. On Mon, Nov 9, 2015 at 4:48 PM, Jim Wilson <jim.wilson@linaro.org <mailto:jim.wilson@linaro.org>> wrote: On Mon, Nov 9, 2015 at 2:39 PM, Bill Fischofer <bill.fischofer@linaro.org <mailto:bill.fischofer@linaro.org>> wrote: > The IO Visor project appears to be doing something like this
with LLVM and > JIT constructs to dynamically insert code into the kernel in a > platform-independent manner. Perhaps we can leverage that technology?
GCC has some experimental JIT support, but I think it would be a
lot of work to use it, and I don't know how stable it is. https://gcc.gnu.org/wiki/JIT The LLVM support is probably more advanced.
Jim _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org> https://lists.linaro.org/mailman/listinfo/lng-odp
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
On Tue, Nov 10, 2015 at 11:08 AM, Maxim Uvarov maxim.uvarov@linaro.org wrote:
On 10 November 2015 at 13:41, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 07:39, Maxim Uvarov wrote:
And it looks like it's problem in OVS, not in ODP. I.e. OVS should allow to use library functions for fast path (where inlines are critical). I.e. not just call odp_packet_len(), but move hole OVS function to dynamic library.
I'm not sure I get your point here, but OVS allows to use dynamic library functions on fast path. The problem is that it's slow, because of the function call overhead.
I'm not familiar with ovs code. But for example ovs has something like:
ovs_get_and_packet_process() { // here you use some inlines: pkt = odp_recv(); len = odp_packet_len(pkt);
... etc.
}
So it's clear for each target arch you needs it's own variant of ovs_get_and_packet_process() function. That function should go from ovs to dynamic library.
Which library? A library specific to OVS? Or some common ODP library that everyone uses? In either case the solution is not scalable. In the first case it still requires the app vendor to have a separate build for each and every supported target. In the second, it is basically argues for all fast-path application-specific code to go into a non-app-specific library. That really won't fly.
I have two answers to this question. One for the short term, and one for the long.
In the short term we have no choice. If we're going to support portable application binaries, then we cannot do inlines. ODP simply isn't set up to support that. Portable binaries will have to take the hit of doing a function call each and every time. It's not fast, but it *works*, which at least will set a lowest common denominator. To mitigate the problem we could encourage application packages to include a generic version (no-inlines, but works everywhere) plus one or more optimized builds (with inlines) and the correct binary is selected at runtime. Not great, but it is a reasonable answer for the short term.
For the long term to get away from per-platform builds, I see two viable options. Bill suggested the first: Use LLVM to optimize at runtime so that thing like inlines get picked up when linked to the platform library. There is some precedence of other projects already doing this, so this isn't as far fetched as it may seem. The second is to do what we already do in the kernel for ftrace: instrument the function calls and runtime patch them with optimized inlines. Not pretty, probably fragile, but we do have the knowledge from the kernel of how to do it. All said, I would prefer an LLVM based solution, but investigation is needed to figure out how to make it work.
g.
On 10 November 2015 at 02:50, Bill Fischofer <bill.fischofer@linaro.org mailto:bill.fischofer@linaro.org> wrote:
Adding Grant Likely to this chain as it relates to the broader subject of portable ABIs that we've been discussing. On Mon, Nov 9, 2015 at 4:48 PM, Jim Wilson <jim.wilson@linaro.org <mailto:jim.wilson@linaro.org>> wrote: On Mon, Nov 9, 2015 at 2:39 PM, Bill Fischofer <bill.fischofer@linaro.org <mailto:bill.fischofer@linaro.org>> wrote: > The IO Visor project appears to be doing something like this
with LLVM and > JIT constructs to dynamically insert code into the kernel in a > platform-independent manner. Perhaps we can leverage that technology?
GCC has some experimental JIT support, but I think it would be a
lot of work to use it, and I don't know how stable it is. https://gcc.gnu.org/wiki/JIT The LLVM support is probably more advanced.
Jim _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org> https://lists.linaro.org/mailman/listinfo/lng-odp
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
On 10-11-2015 10:04, Grant Likely wrote:
On Tue, Nov 10, 2015 at 11:08 AM, Maxim Uvarov maxim.uvarov@linaro.org wrote:
On 10 November 2015 at 13:41, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 07:39, Maxim Uvarov wrote:
And it looks like it's problem in OVS, not in ODP. I.e. OVS should allow to use library functions for fast path (where inlines are critical). I.e. not just call odp_packet_len(), but move hole OVS function to dynamic library.
I'm not sure I get your point here, but OVS allows to use dynamic library functions on fast path. The problem is that it's slow, because of the function call overhead.
I'm not familiar with ovs code. But for example ovs has something like:
ovs_get_and_packet_process() { // here you use some inlines: pkt = odp_recv(); len = odp_packet_len(pkt);
... etc.
}
So it's clear for each target arch you needs it's own variant of ovs_get_and_packet_process() function. That function should go from ovs to dynamic library.
Which library? A library specific to OVS? Or some common ODP library that everyone uses? In either case the solution is not scalable. In the first case it still requires the app vendor to have a separate build for each and every supported target. In the second, it is basically argues for all fast-path application-specific code to go into a non-app-specific library. That really won't fly.
I have two answers to this question. One for the short term, and one for the long.
In the short term we have no choice. If we're going to support portable application binaries, then we cannot do inlines. ODP simply isn't set up to support that. Portable binaries will have to take the hit of doing a function call each and every time. It's not fast, but it *works*, which at least will set a lowest common denominator. To mitigate the problem we could encourage application packages to include a generic version (no-inlines, but works everywhere) plus one or more optimized builds (with inlines) and the correct binary is selected at runtime. Not great, but it is a reasonable answer for the short term.
For the long term to get away from per-platform builds, I see two viable options. Bill suggested the first: Use LLVM to optimize at runtime so that thing like inlines get picked up when linked to the platform library. There is some precedence of other projects already doing this, so this isn't as far fetched as it may seem. The second is to do what we already do in the kernel for ftrace: instrument the function calls and runtime patch them with optimized inlines. Not pretty, probably fragile, but we do have the knowledge from the kernel of how to do it. All said, I would prefer an LLVM based solution, but investigation is needed to figure out how to make it work.
The LLVM JIT approach will require a lot of engineer work from ODP side. Currently LLVM provides two JIT engines: the MCJIT and the ORC (which is new on LLVM 3.7).
The MCJIT work on 'modules': the programs can either pass a C or IR file or use the API to create a module with multiple functions. The JIT engine will then build and create a ELF module that will be loaded in process address VMA. It is essentially an AOT JIT.
The ORC stands for 'On Request Compilation' and it differ than MCJIT is aiming to lazy compilation using indirection hooks. The function won't be JITted until is is called. [1]
In any case you won't have inline speed if you decide to just JIT the inline calls, it will still be an indirection calls to the JIT functions. Neither supports patchpoints, which was the kernel does to dynamically change the code to patch for specific instructions.
If you want to actually dynamic change the code you can try the DynamicRIO [2] project that aims to provide an API to do so. However it is aimed for instrumentation, so I am not sure how well it plays with performance-wise projects.
I would suggest instead of focus on dynamic code generation for such inlines, to work on more general functions that are actually called through either PLT or indirections and crate runtime dispatch based on runtime.
You can follow the GCC strategy to do indirection calls (the __builtin_supports('') which openssl emulates as well) or since it is a library to use IFUNC on the PLT calls (like GLIBC does with memory and math operations). With current GCC you can build different versions of the same function and add a IFUNC dispatch to select the best one at runtime.
[1] http://article.gmane.org/gmane.comp.compilers.llvm.devel/80639 [2] http://www.dynamorio.org/
g.
On 10 November 2015 at 02:50, Bill Fischofer <bill.fischofer@linaro.org mailto:bill.fischofer@linaro.org> wrote:
Adding Grant Likely to this chain as it relates to the broader subject of portable ABIs that we've been discussing. On Mon, Nov 9, 2015 at 4:48 PM, Jim Wilson <jim.wilson@linaro.org <mailto:jim.wilson@linaro.org>> wrote: On Mon, Nov 9, 2015 at 2:39 PM, Bill Fischofer <bill.fischofer@linaro.org <mailto:bill.fischofer@linaro.org>> wrote: > The IO Visor project appears to be doing something like this
with LLVM and > JIT constructs to dynamically insert code into the kernel in a > platform-independent manner. Perhaps we can leverage that technology?
GCC has some experimental JIT support, but I think it would be a
lot of work to use it, and I don't know how stable it is. https://gcc.gnu.org/wiki/JIT The LLVM support is probably more advanced.
Jim _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org> https://lists.linaro.org/mailman/listinfo/lng-odp
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
On 10/11/15 12:04, Grant Likely wrote:
On Tue, Nov 10, 2015 at 11:08 AM, Maxim Uvarov maxim.uvarov@linaro.org wrote:
On 10 November 2015 at 13:41, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 07:39, Maxim Uvarov wrote:
And it looks like it's problem in OVS, not in ODP. I.e. OVS should allow to use library functions for fast path (where inlines are critical). I.e. not just call odp_packet_len(), but move hole OVS function to dynamic library.
I'm not sure I get your point here, but OVS allows to use dynamic library functions on fast path. The problem is that it's slow, because of the function call overhead.
I'm not familiar with ovs code. But for example ovs has something like:
ovs_get_and_packet_process() { // here you use some inlines: pkt = odp_recv(); len = odp_packet_len(pkt);
... etc.
}
So it's clear for each target arch you needs it's own variant of ovs_get_and_packet_process() function. That function should go from ovs to dynamic library.
Which library? A library specific to OVS? Or some common ODP library that everyone uses? In either case the solution is not scalable. In the first case it still requires the app vendor to have a separate build for each and every supported target. In the second, it is basically argues for all fast-path application-specific code to go into a non-app-specific library. That really won't fly.
I have two answers to this question. One for the short term, and one for the long.
In the short term we have no choice. If we're going to support portable application binaries, then we cannot do inlines. ODP simply isn't set up to support that. Portable binaries will have to take the hit of doing a function call each and every time. It's not fast, but it *works*, which at least will set a lowest common denominator. To mitigate the problem we could encourage application packages to include a generic version (no-inlines, but works everywhere) plus one or more optimized builds (with inlines) and the correct binary is selected at runtime. Not great, but it is a reasonable answer for the short term.
I would argue for the short term to produce platform specific packages as well, at least for ODP-OVS. As ODP-OVS is not upstream, we need to produce an openvswitch-odp package anyway (which would set to conflict with the normal openvswitch package). My idea is to create openvswitch-odp-[platform] packages, though I don't know if you can set a wildcard conflict rule during packaging to make sure only one of them are installed at a time.
For the long term to get away from per-platform builds, I see two viable options. Bill suggested the first: Use LLVM to optimize at runtime so that thing like inlines get picked up when linked to the platform library. There is some precedence of other projects already doing this, so this isn't as far fetched as it may seem.
But wouldn't it tie us down with LLVM?
The second is to do what we already do in the kernel for ftrace: instrument the function calls and runtime patch them with optimized inlines. Not pretty, probably fragile, but we do have the knowledge from the kernel
Yes, I was thinking also about the ftrace way, but I'm not familiar with ld.so enough to judge how hard it would be.
of how to do it. All said, I would prefer an LLVM based solution, but investigation is needed to figure out how to make it work.
g.
On 10 November 2015 at 02:50, Bill Fischofer <bill.fischofer@linaro.org mailto:bill.fischofer@linaro.org> wrote:
Adding Grant Likely to this chain as it relates to the broader subject of portable ABIs that we've been discussing. On Mon, Nov 9, 2015 at 4:48 PM, Jim Wilson <jim.wilson@linaro.org <mailto:jim.wilson@linaro.org>> wrote: On Mon, Nov 9, 2015 at 2:39 PM, Bill Fischofer <bill.fischofer@linaro.org <mailto:bill.fischofer@linaro.org>> wrote: > The IO Visor project appears to be doing something like this
with LLVM and > JIT constructs to dynamically insert code into the kernel in a > platform-independent manner. Perhaps we can leverage that technology?
GCC has some experimental JIT support, but I think it would be a
lot of work to use it, and I don't know how stable it is. https://gcc.gnu.org/wiki/JIT The LLVM support is probably more advanced.
Jim _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org> https://lists.linaro.org/mailman/listinfo/lng-odp
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
On Tue, Nov 10, 2015 at 3:04 PM, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 12:04, Grant Likely wrote:
On Tue, Nov 10, 2015 at 11:08 AM, Maxim Uvarov maxim.uvarov@linaro.org wrote:
On 10 November 2015 at 13:41, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 07:39, Maxim Uvarov wrote:
And it looks like it's problem in OVS, not in ODP. I.e. OVS should allow to use library functions for fast path (where inlines are critical). I.e. not just call odp_packet_len(), but move hole OVS function to dynamic library.
I'm not sure I get your point here, but OVS allows to use dynamic library functions on fast path. The problem is that it's slow, because of the function call overhead.
I'm not familiar with ovs code. But for example ovs has something like:
ovs_get_and_packet_process() { // here you use some inlines: pkt = odp_recv(); len = odp_packet_len(pkt);
... etc.
}
So it's clear for each target arch you needs it's own variant of ovs_get_and_packet_process() function. That function should go from ovs to dynamic library.
Which library? A library specific to OVS? Or some common ODP library that everyone uses? In either case the solution is not scalable. In the first case it still requires the app vendor to have a separate build for each and every supported target. In the second, it is basically argues for all fast-path application-specific code to go into a non-app-specific library. That really won't fly.
I have two answers to this question. One for the short term, and one for the long.
In the short term we have no choice. If we're going to support portable application binaries, then we cannot do inlines. ODP simply isn't set up to support that. Portable binaries will have to take the hit of doing a function call each and every time. It's not fast, but it *works*, which at least will set a lowest common denominator. To mitigate the problem we could encourage application packages to include a generic version (no-inlines, but works everywhere) plus one or more optimized builds (with inlines) and the correct binary is selected at runtime. Not great, but it is a reasonable answer for the short term.
I would argue for the short term to produce platform specific packages as well, at least for ODP-OVS. As ODP-OVS is not upstream, we need to produce an openvswitch-odp package anyway (which would set to conflict with the normal openvswitch package). My idea is to create openvswitch-odp-[platform] packages, though I don't know if you can set a wildcard conflict rule during packaging to make sure only one of them are installed at a time.
For the long term to get away from per-platform builds, I see two viable options. Bill suggested the first: Use LLVM to optimize at runtime so that thing like inlines get picked up when linked to the platform library. There is some precedence of other projects already doing this, so this isn't as far fetched as it may seem.
But wouldn't it tie us down with LLVM?
Does that worry you? LLVM is a mature project, open source, and lots of momentum behind it. There are worse things we can do than align with LLVM when it brings capability that we cannot get anywhere else.
g.
On 10/11/15 15:08, Grant Likely wrote:
On Tue, Nov 10, 2015 at 3:04 PM, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 12:04, Grant Likely wrote:
On Tue, Nov 10, 2015 at 11:08 AM, Maxim Uvarov maxim.uvarov@linaro.org wrote:
On 10 November 2015 at 13:41, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 07:39, Maxim Uvarov wrote:
And it looks like it's problem in OVS, not in ODP. I.e. OVS should allow to use library functions for fast path (where inlines are critical). I.e. not just call odp_packet_len(), but move hole OVS function to dynamic library.
I'm not sure I get your point here, but OVS allows to use dynamic library functions on fast path. The problem is that it's slow, because of the function call overhead.
I'm not familiar with ovs code. But for example ovs has something like:
ovs_get_and_packet_process() { // here you use some inlines: pkt = odp_recv(); len = odp_packet_len(pkt);
... etc.
}
So it's clear for each target arch you needs it's own variant of ovs_get_and_packet_process() function. That function should go from ovs to dynamic library.
Which library? A library specific to OVS? Or some common ODP library that everyone uses? In either case the solution is not scalable. In the first case it still requires the app vendor to have a separate build for each and every supported target. In the second, it is basically argues for all fast-path application-specific code to go into a non-app-specific library. That really won't fly.
I have two answers to this question. One for the short term, and one for the long.
In the short term we have no choice. If we're going to support portable application binaries, then we cannot do inlines. ODP simply isn't set up to support that. Portable binaries will have to take the hit of doing a function call each and every time. It's not fast, but it *works*, which at least will set a lowest common denominator. To mitigate the problem we could encourage application packages to include a generic version (no-inlines, but works everywhere) plus one or more optimized builds (with inlines) and the correct binary is selected at runtime. Not great, but it is a reasonable answer for the short term.
I would argue for the short term to produce platform specific packages as well, at least for ODP-OVS. As ODP-OVS is not upstream, we need to produce an openvswitch-odp package anyway (which would set to conflict with the normal openvswitch package). My idea is to create openvswitch-odp-[platform] packages, though I don't know if you can set a wildcard conflict rule during packaging to make sure only one of them are installed at a time.
For the long term to get away from per-platform builds, I see two viable options. Bill suggested the first: Use LLVM to optimize at runtime so that thing like inlines get picked up when linked to the platform library. There is some precedence of other projects already doing this, so this isn't as far fetched as it may seem.
But wouldn't it tie us down with LLVM?
Does that worry you?
Only that then we require our applications to use LLVM if they want performance. I don't know the impact of that.
LLVM is a mature project, open source, and lots of momentum behind it. There are worse things we can do than align with LLVM when it brings capability that we cannot get anywhere else.
g.
On Nov 10, 2015, at 7:28 AM, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 15:08, Grant Likely wrote:
On Tue, Nov 10, 2015 at 3:04 PM, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 10/11/15 12:04, Grant Likely wrote:
On Tue, Nov 10, 2015 at 11:08 AM, Maxim Uvarov maxim.uvarov@linaro.org wrote:
On 10 November 2015 at 13:41, Zoltan Kiss zoltan.kiss@linaro.org wrote:
> On 10/11/15 07:39, Maxim Uvarov wrote: > > And it looks like it's problem in OVS, not in ODP. I.e. OVS should > allow > to use library functions for fast path (where inlines are critical). > I.e. not just call odp_packet_len(), but move hole OVS function to > dynamic library.
I'm not sure I get your point here, but OVS allows to use dynamic library functions on fast path. The problem is that it's slow, because of the function call overhead.
I'm not familiar with ovs code. But for example ovs has something like:
ovs_get_and_packet_process() { // here you use some inlines: pkt = odp_recv(); len = odp_packet_len(pkt);
... etc.
}
So it's clear for each target arch you needs it's own variant of ovs_get_and_packet_process() function. That function should go from ovs to dynamic library.
Which library? A library specific to OVS? Or some common ODP library that everyone uses? In either case the solution is not scalable. In the first case it still requires the app vendor to have a separate build for each and every supported target. In the second, it is basically argues for all fast-path application-specific code to go into a non-app-specific library. That really won't fly.
I have two answers to this question. One for the short term, and one for the long.
In the short term we have no choice. If we're going to support portable application binaries, then we cannot do inlines. ODP simply isn't set up to support that. Portable binaries will have to take the hit of doing a function call each and every time. It's not fast, but it *works*, which at least will set a lowest common denominator. To mitigate the problem we could encourage application packages to include a generic version (no-inlines, but works everywhere) plus one or more optimized builds (with inlines) and the correct binary is selected at runtime. Not great, but it is a reasonable answer for the short term.
I would argue for the short term to produce platform specific packages as well, at least for ODP-OVS. As ODP-OVS is not upstream, we need to produce an openvswitch-odp package anyway (which would set to conflict with the normal openvswitch package). My idea is to create openvswitch-odp-[platform] packages, though I don't know if you can set a wildcard conflict rule during packaging to make sure only one of them are installed at a time.
For the long term to get away from per-platform builds, I see two viable options. Bill suggested the first: Use LLVM to optimize at runtime so that thing like inlines get picked up when linked to the platform library. There is some precedence of other projects already doing this, so this isn't as far fetched as it may seem.
But wouldn't it tie us down with LLVM?
Does that worry you?
Only that then we require our applications to use LLVM if they want performance. I don't know the impact of that.
Or they recompile the programs to get the speed. I am sorry but this is not a new problem. Most of the embedded folks are use to this. What a vendor of odp could do is provide an optimized version of the programs they think are important.
Thanks, Andrew
LLVM is a mature project, open source, and lots of momentum behind it. There are worse things we can do than align with LLVM when it brings capability that we cannot get anywhere else.
g.
linaro-toolchain mailing list linaro-toolchain@lists.linaro.org https://lists.linaro.org/mailman/listinfo/linaro-toolchain
On 10/11/15 11:08, Maxim Uvarov wrote:
On 10 November 2015 at 13:41, Zoltan Kiss <zoltan.kiss@linaro.org mailto:zoltan.kiss@linaro.org> wrote:
On 10/11/15 07:39, Maxim Uvarov wrote: JIT like lua might also not work because you need to rewrite OVS to support it. I don't think that it will be accepted. And it looks like it's problem in OVS, not in ODP. I.e. OVS should allow to use library functions for fast path (where inlines are critical). I.e. not just call odp_packet_len(), but move hole OVS function to dynamic library. I'm not sure I get your point here, but OVS allows to use dynamic library functions on fast path. The problem is that it's slow, because of the function call overhead.
I'm not familiar with ovs code. But for example ovs has something like:
ovs_get_and_packet_process() { // here you use some inlines: pkt = odp_recv(); len = odp_packet_len(pkt);
... etc.
}
So it's clear for each target arch you needs it's own variant of ovs_get_and_packet_process() function. That function should go from ovs to dynamic library.
I see. That would mitigate some of the problems, but unfortunately the usage of these accessor functions couldn't be narrowed down to particular piece of fast path code. E.g. the packet length is a quite good example, you need it very often during processing, at different parts of the code.
Maxim.
regards, Maxim. On 10 November 2015 at 02:50, Bill Fischofer <bill.fischofer@linaro.org <mailto:bill.fischofer@linaro.org> <mailto:bill.fischofer@linaro.org <mailto:bill.fischofer@linaro.org>>> wrote: Adding Grant Likely to this chain as it relates to the broader subject of portable ABIs that we've been discussing. On Mon, Nov 9, 2015 at 4:48 PM, Jim Wilson <jim.wilson@linaro.org <mailto:jim.wilson@linaro.org> <mailto:jim.wilson@linaro.org <mailto:jim.wilson@linaro.org>>> wrote: On Mon, Nov 9, 2015 at 2:39 PM, Bill Fischofer <bill.fischofer@linaro.org <mailto:bill.fischofer@linaro.org> <mailto:bill.fischofer@linaro.org <mailto:bill.fischofer@linaro.org>>> wrote: > The IO Visor project appears to be doing something like this with LLVM and > JIT constructs to dynamically insert code into the kernel in a > platform-independent manner. Perhaps we can leverage that technology? GCC has some experimental JIT support, but I think it would be a lot of work to use it, and I don't know how stable it is. https://gcc.gnu.org/wiki/JIT The LLVM support is probably more advanced. Jim _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org> <mailto:lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org>> https://lists.linaro.org/mailman/listinfo/lng-odp _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org <mailto:lng-odp@lists.linaro.org> https://lists.linaro.org/mailman/listinfo/lng-odp
On 6 November 2015 at 15:48, Zoltan Kiss zoltan.kiss@linaro.org wrote:
Hi,
We have a packaging/linking/optimization problem at LNG, I hope you guys can give us some advice on that. (Cc'ing ODP list in case someone want to add something) We have OpenDataPlane (ODP), an API stretching between userspace applications and hardware SDKs. It's defined in the form of C headers, and we already have several implementations to face SDKs (or whathever is actually controlling the hardware), e.g. linux-generic, a DPDK one etc. And we have applications, like Open vSwitch (OVS), which now is able to work with any ODP platform implementation which implements this API When it comes to packaging, the ideal scenario would be to create one package for the application, e.g. openvswitch.deb, and one for each platform, e.g odp-generic.deb, odp-dpdk.deb. The latter would contain the implementations in the form of a libodp.so file, so the application can dynamically load the actually installed platform's library runtime, with all the benefits of dynamic linking.
We also need binary compatibility between different ODP implementations. Binary compatibility that goes beyond an ABI.
I would be happy if we for a start could prove that we actually have source code compatibility. E.g. compile and run the exact same app using different ODP implementations and run them on their respective platforms with the expected behaviour (including performance).
The trouble is that we have several accessor functions in the API which are
very short and __very__ frequently used. The best example is "uint32_t odp_packet_len(odp_packet_t pkt)", which returns the length of the packet. odp_packet_t is an opaque type defined by the implementation, often a pointer to the packet's actual metadata, so the actual function call yields to a simple load from that metadata pointer (+offset). Having it wrapped into a function call brings a significant performance decrease: when forwarding 64 byte packets at 10 Gbps, I got 13.2 Mpps with function calls. When I've inlined that function it brought 13.8 Mpps, that's ~5% difference. And there are a lot of other frequently used short accessor functions with the same problem. But obviously if I inline these functions I break the ABI, and I need to compile the application for each platform (and create packages like openvswitch-odp-dpdk.deb, containing the platform statically linked). I've tried to look around on Google and in gcc manual, but I couldn't find a good solution for this kind of problem. I've checked link time optimization (-flto), but it only helps with static linking. Is there any way to keep the ODP application and platform implementation binaries in separate files while having the performance benefit of inlining?
Regards,
Zoltan _______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
On 11/06/2015 05:48 PM, Zoltan Kiss wrote:
Hi,
We have a packaging/linking/optimization problem at LNG, I hope you guys can give us some advice on that. (Cc'ing ODP list in case someone want to add something) We have OpenDataPlane (ODP), an API stretching between userspace applications and hardware SDKs. It's defined in the form of C headers, and we already have several implementations to face SDKs (or whathever is actually controlling the hardware), e.g. linux-generic, a DPDK one etc. And we have applications, like Open vSwitch (OVS), which now is able to work with any ODP platform implementation which implements this API When it comes to packaging, the ideal scenario would be to create one package for the application, e.g. openvswitch.deb, and one for each platform, e.g odp-generic.deb, odp-dpdk.deb. The latter would contain the implementations in the form of a libodp.so file, so the application can dynamically load the actually installed platform's library runtime, with all the benefits of dynamic linking. The trouble is that we have several accessor functions in the API which are very short and __very__ frequently used. The best example is "uint32_t odp_packet_len(odp_packet_t pkt)", which returns the length of the packet. odp_packet_t is an opaque type defined by the implementation, often a pointer to the packet's actual metadata, so the actual function call yields to a simple load from that metadata pointer (+offset). Having it wrapped into a function call brings a significant performance decrease: when forwarding 64 byte packets at 10 Gbps, I got 13.2 Mpps with function calls. When I've inlined that function it brought 13.8 Mpps, that's ~5% difference. And there are a lot of other frequently used short accessor functions with the same problem. But obviously if I inline these functions I break the ABI, and I need to compile the application for each platform (and create packages like openvswitch-odp-dpdk.deb, containing the platform statically linked). I've tried to look around on Google and in gcc manual, but I couldn't find a good solution for this kind of problem. I've checked link time optimization (-flto), but it only helps with static linking. Is there any way to keep the ODP application and platform implementation binaries in separate files while having the performance benefit of inlining?
Regards,
Zoltan
Hi!
If all you need is to have fast and portable binary, I wonder if you could use relocations to attain your goal. I mean, to make the dynamic linker overwrite at startup time the call instructions with some machine specific absolute values. E.g. with 0xe590400c which is the binary representation of the 'ldr r0, [r0, #12]' instruction and which seems to be fully equivalent to the call to odp_packet_len.
Another slightly different option is to inline the call at build time, but to have the dynamic linker patch some inlined instructions at startup time. For example, to write correct offsets into the load instructions.
Nikolai
On 13/11/15 16:19, Nikolai Bozhenov wrote:
On 11/06/2015 05:48 PM, Zoltan Kiss wrote:
Hi,
We have a packaging/linking/optimization problem at LNG, I hope you guys can give us some advice on that. (Cc'ing ODP list in case someone want to add something) We have OpenDataPlane (ODP), an API stretching between userspace applications and hardware SDKs. It's defined in the form of C headers, and we already have several implementations to face SDKs (or whathever is actually controlling the hardware), e.g. linux-generic, a DPDK one etc. And we have applications, like Open vSwitch (OVS), which now is able to work with any ODP platform implementation which implements this API When it comes to packaging, the ideal scenario would be to create one package for the application, e.g. openvswitch.deb, and one for each platform, e.g odp-generic.deb, odp-dpdk.deb. The latter would contain the implementations in the form of a libodp.so file, so the application can dynamically load the actually installed platform's library runtime, with all the benefits of dynamic linking. The trouble is that we have several accessor functions in the API which are very short and __very__ frequently used. The best example is "uint32_t odp_packet_len(odp_packet_t pkt)", which returns the length of the packet. odp_packet_t is an opaque type defined by the implementation, often a pointer to the packet's actual metadata, so the actual function call yields to a simple load from that metadata pointer (+offset). Having it wrapped into a function call brings a significant performance decrease: when forwarding 64 byte packets at 10 Gbps, I got 13.2 Mpps with function calls. When I've inlined that function it brought 13.8 Mpps, that's ~5% difference. And there are a lot of other frequently used short accessor functions with the same problem. But obviously if I inline these functions I break the ABI, and I need to compile the application for each platform (and create packages like openvswitch-odp-dpdk.deb, containing the platform statically linked). I've tried to look around on Google and in gcc manual, but I couldn't find a good solution for this kind of problem. I've checked link time optimization (-flto), but it only helps with static linking. Is there any way to keep the ODP application and platform implementation binaries in separate files while having the performance benefit of inlining?
Regards,
Zoltan
Hi!
If all you need is to have fast and portable binary, I wonder if you could use relocations to attain your goal. I mean, to make the dynamic linker overwrite at startup time the call instructions with some machine specific absolute values. E.g. with 0xe590400c which is the binary representation of the 'ldr r0, [r0, #12]' instruction and which seems to be fully equivalent to the call to odp_packet_len.
Something like that would be the best, yes, but it seems gcc and friends doesn't support that. As others said, probably LLVM has a JIT which can do that.
Another slightly different option is to inline the call at build time, but to have the dynamic linker patch some inlined instructions at startup time. For example, to write correct offsets into the load instructions.
Nikolai
My information is admittedly a bit dated now - but last time I looked into it LLVM still had some performance deficits versus gcc. Have they caught up now? Otherwise any performance gains from their JIT inlining technology may be offset by diminished performance overall.
On Fri, Nov 13, 2015 at 10:35 AM, Zoltan Kiss zoltan.kiss@linaro.org wrote:
On 13/11/15 16:19, Nikolai Bozhenov wrote:
On 11/06/2015 05:48 PM, Zoltan Kiss wrote:
Hi,
We have a packaging/linking/optimization problem at LNG, I hope you guys can give us some advice on that. (Cc'ing ODP list in case someone want to add something) We have OpenDataPlane (ODP), an API stretching between userspace applications and hardware SDKs. It's defined in the form of C headers, and we already have several implementations to face SDKs (or whathever is actually controlling the hardware), e.g. linux-generic, a DPDK one etc. And we have applications, like Open vSwitch (OVS), which now is able to work with any ODP platform implementation which implements this API When it comes to packaging, the ideal scenario would be to create one package for the application, e.g. openvswitch.deb, and one for each platform, e.g odp-generic.deb, odp-dpdk.deb. The latter would contain the implementations in the form of a libodp.so file, so the application can dynamically load the actually installed platform's library runtime, with all the benefits of dynamic linking. The trouble is that we have several accessor functions in the API which are very short and __very__ frequently used. The best example is "uint32_t odp_packet_len(odp_packet_t pkt)", which returns the length of the packet. odp_packet_t is an opaque type defined by the implementation, often a pointer to the packet's actual metadata, so the actual function call yields to a simple load from that metadata pointer (+offset). Having it wrapped into a function call brings a significant performance decrease: when forwarding 64 byte packets at 10 Gbps, I got 13.2 Mpps with function calls. When I've inlined that function it brought 13.8 Mpps, that's ~5% difference. And there are a lot of other frequently used short accessor functions with the same problem. But obviously if I inline these functions I break the ABI, and I need to compile the application for each platform (and create packages like openvswitch-odp-dpdk.deb, containing the platform statically linked). I've tried to look around on Google and in gcc manual, but I couldn't find a good solution for this kind of problem. I've checked link time optimization (-flto), but it only helps with static linking. Is there any way to keep the ODP application and platform implementation binaries in separate files while having the performance benefit of inlining?
Regards,
Zoltan
Hi!
If all you need is to have fast and portable binary, I wonder if you could use relocations to attain your goal. I mean, to make the dynamic linker overwrite at startup time the call instructions with some machine specific absolute values. E.g. with 0xe590400c which is the binary representation of the 'ldr r0, [r0, #12]' instruction and which seems to be fully equivalent to the call to odp_packet_len.
Something like that would be the best, yes, but it seems gcc and friends doesn't support that. As others said, probably LLVM has a JIT which can do that.
Another slightly different option is to inline the call at build time, but to have the dynamic linker patch some inlined instructions at startup time. For example, to write correct offsets into the load instructions.
Nikolai
lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp
On 11/13/2015 07:35 PM, Zoltan Kiss wrote:
On 13/11/15 16:19, Nikolai Bozhenov wrote:
On 11/06/2015 05:48 PM, Zoltan Kiss wrote:
Hi,
We have a packaging/linking/optimization problem at LNG, I hope you guys can give us some advice on that. (Cc'ing ODP list in case someone want to add something) We have OpenDataPlane (ODP), an API stretching between userspace applications and hardware SDKs. It's defined in the form of C headers, and we already have several implementations to face SDKs (or whathever is actually controlling the hardware), e.g. linux-generic, a DPDK one etc. And we have applications, like Open vSwitch (OVS), which now is able to work with any ODP platform implementation which implements this API When it comes to packaging, the ideal scenario would be to create one package for the application, e.g. openvswitch.deb, and one for each platform, e.g odp-generic.deb, odp-dpdk.deb. The latter would contain the implementations in the form of a libodp.so file, so the application can dynamically load the actually installed platform's library runtime, with all the benefits of dynamic linking. The trouble is that we have several accessor functions in the API which are very short and __very__ frequently used. The best example is "uint32_t odp_packet_len(odp_packet_t pkt)", which returns the length of the packet. odp_packet_t is an opaque type defined by the implementation, often a pointer to the packet's actual metadata, so the actual function call yields to a simple load from that metadata pointer (+offset). Having it wrapped into a function call brings a significant performance decrease: when forwarding 64 byte packets at 10 Gbps, I got 13.2 Mpps with function calls. When I've inlined that function it brought 13.8 Mpps, that's ~5% difference. And there are a lot of other frequently used short accessor functions with the same problem. But obviously if I inline these functions I break the ABI, and I need to compile the application for each platform (and create packages like openvswitch-odp-dpdk.deb, containing the platform statically linked). I've tried to look around on Google and in gcc manual, but I couldn't find a good solution for this kind of problem. I've checked link time optimization (-flto), but it only helps with static linking. Is there any way to keep the ODP application and platform implementation binaries in separate files while having the performance benefit of inlining?
Regards,
Zoltan
Hi!
If all you need is to have fast and portable binary, I wonder if you could use relocations to attain your goal. I mean, to make the dynamic linker overwrite at startup time the call instructions with some machine specific absolute values. E.g. with 0xe590400c which is the binary representation of the 'ldr r0, [r0, #12]' instruction and which seems to be fully equivalent to the call to odp_packet_len.
Something like that would be the best, yes, but it seems gcc and friends doesn't support that. As others said, probably LLVM has a JIT which can do that.
I don't think you need JIT. JIT is obviously overkill for that. All you need is to reserve some space at compile time (e.g. with an inline assembly) and then patch the space at startup. The latter sounds like a task for the loader.
Though, I don't think there is support for that in any toolchain. It is not typical to have such hot small functions in shared libraries. So, you will have to do some development in the toolchain anyway to support the suggested optimization.
Nikolai
linaro-toolchain@lists.linaro.org