On 06.04.23 15:22, Mathias Krause wrote:
On 06.04.23 04:25, Sean Christopherson wrote:
On Sat, Mar 25, 2023, Greg KH wrote:
On Sat, Mar 25, 2023 at 12:39:59PM +0100, Mathias Krause wrote:
As this is a huge performance fix for us, we'd like to get it integrated into current stable kernels as well -- not without having the changes get some wider testing, of course, i.e. not before they end up in a non-rc version released by Linus. But I already did a backport to 5.4 to get a feeling how hard it would be and for the impact it has on older kernels.
Using the 'ssdd 10 50000' test I used before, I get promising results there as well. Without the patches it takes 9.31s, while with them we're down to 4.64s. Taking into account that this is the runtime of a workload in a VM that gets cut in half, I hope this qualifies as stable material, as it's a huge performance fix.
Greg, what's your opinion on it? Original series here: https://lore.kernel.org/kvm/20230322013731.102955-1-minipli@grsecurity.net/
I'll leave the judgement call up to the KVM maintainers, as they are the ones that need to ack any KVM patch added to stable trees.
These are quite risky to backport. E.g. we botched patch 6[*], and my initial fix also had a subtle bug. There have also been quite a few KVM MMU changes since 5.4, so it's possible that an edge case may exist in 5.4 that doesn't exist in mainline.
I totally agree. Getting the changes to work with older kernels needs more work. The MMU role handling was refactored in 5.14 and down to 5.4 it differs even more, so backports to earlier kernels definitely needs more care.
My plan would be to limit backporting of the whole series to kernels down to 5.15 (maybe 5.10 if it turns out to be doable) and for kernels before that only without patch 6. That would leave out the problematic change but still give us the benefits of dropping the needless mmu unloads for only toggling CR0.WP in the VM. This already helps us a lot!
To back up the "helps us a lot" with some numbers, here are the results I got from running the 'ssdd 10 50000' micro-benchmark on the backports I did, running on a grsecurity L1 VM (host is a vanilla kernel, as stated below; runtime in seconds, lower is better):
legacy TDP shadow Linux v5.4.240 - 8.87s 56.8s + patches - 5.84s 55.4s
Linux v5.10.177 10.37s 88.7s 69.7s + patches 4.88s 4.92s 70.1s
Linux v5.15.106 9.94s 66.1s 64.9s + patches 4.81s 4.79s 64.6s
Linux v6.1.23 7.65s 8.23s 68.7s + patches 3.36s 3.36s 69.1s
Linux v6.2.10 7.61s 7.98s 68.6s + patches 3.37s 3.41s 70.2s
I guess we can grossly ignore the shadow MMU numbers, beside noting them to regress from v5.4 to v5.10 (something to investigate?). The backports don't help (much) for shadow MMU setups and the flux in the measurements is likely related to the slab allocations involved.
Another unrelated data point is that TDP MMU is really broken for our use case on v5.10 and v5.15 -- it's even slower that shadow paging!
OTOH, the backports give nice speed-ups, ranging from ~2.2 times faster for pure EPT (legacy) MMU setups up to 18(!!!) times faster for TDP MMU on v5.10.
I backported the whole series down to v5.10 but left out the CR0.WP guest owning patch+fix for v5.4 as the code base is too different to get all the nuances right, as Sean already hinted. However, even this limited backport provides a big performance fix for our use case!
Thanks, Mathias