[PATCH] ARM64: KVM: Fix coherent_icache_guest_page() for host with external L3-cache.

List overview All Threads
Download

newer

older

[PATCH] leds: wm831x-status:...

[PATCH v3 0/1] ARM: signal:...

Pranavkumar Sawargaonkar

14 Aug 2013 14 Aug '13

11:47 a.m.

Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org --- arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE); + /* Flush d-cache for systems with external caches. */ + __flush_dcache_area((void *) hva, PAGE_SIZE); } else if (!icache_is_aivivt()) { /* non ASID-tagged VIVT */ /* any kind of VIPT cache */ __flush_icache_all();

-- 1.7.9.5

Show replies by date

Marc Zyngier

14 Aug 14 Aug

12:04 p.m.

Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...

Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
/* Flush d-cache for systems with external caches. */
__flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) { /* non ASID-tagged VIVT */ /* any kind of VIPT cache */ __flush_icache_all();

[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.

The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

-- Fast, cheap, reliable. Pick two.

Anup Patel

2:22 p.m.

On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...

Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. */
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
        /* any kind of VIPT cache */
        __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.

It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

...

The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.

...

     M.
-- Fast, cheap, reliable. Pick two. _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

--Anup

Alexander Graf

3:06 p.m.

On 14.08.2013, at 16:22, Anup Patel wrote:

...

On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. */
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) { /* non ASID-tagged VIVT */ /* any kind of VIPT cache */ __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

Not sure I understand at which point you really need to make things coherent here. When you assign a new ASID because there could be stale cache entries left from an old ASID run? Can't you just flush all caches when you overrun your ASID allocator?

...

...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.

Is there any difference in performance between the 2 approaches?

Alex

Marc Zyngier

3:34 p.m.

On 2013-08-14 16:06, Alexander Graf wrote:

...

On 14.08.2013, at 16:22, Anup Patel wrote:

...
On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. 
*/
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) { /* non ASID-tagged
VIVT */ /* any kind of VIPT cache */ __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.
Not sure I understand at which point you really need to make things coherent here. When you assign a new ASID because there could be stale cache entries left from an old ASID run? Can't you just flush all caches when you overrun your ASID allocator?

First, a bit of terminology: - ASIDs are used for userspace. KVM doesn't deal with those. - VMIDs are used for VMs. That's what KVM deals with.

The issue here is that when the MMU is disabled, the access goes straight to RAM, bypassing the caches altogether (OK not completely true, but for the sake of the argument, that's close enough).

When userspace loads the kernel into memory, the kernel is not flushed to RAM, and may sit in the L3 cache if the cache is big enough. You end-up executing garbage... My proposed fix is to let kvmtool do the flushing, as we have userspace cache management operations for this exact purpose.

...

...
...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.

Is there any difference in performance between the 2 approaches?

dcache clean on each and every page being mapped in stage-2, versus a one-time penalty. Should be significant indeed.

But again, I'd like to see figures. What is the impact of each of the 3 methods for various memory and cache sizes?

-- Who you jivin' with that Cosmik Debris?

Peter Maydell

3:41 p.m.

On 14 August 2013 16:34, Marc Zyngier maz@misterjones.org wrote:

...

When userspace loads the kernel into memory, the kernel is not flushed to RAM, and may sit in the L3 cache if the cache is big enough. You end-up executing garbage... My proposed fix is to let kvmtool do the flushing, as we have userspace cache management operations for this exact purpose.

Why does this issue only apply to the loaded kernel and not to the zero bytes in the rest of RAM? I know executing zeroes isn't a very useful thing to do but it should be a well defined thing.

-- PMM

Marc Zyngier

3:57 p.m.

On 2013-08-14 16:41, Peter Maydell wrote:

...

On 14 August 2013 16:34, Marc Zyngier maz@misterjones.org wrote:

...
When userspace loads the kernel into memory, the kernel is not flushed to RAM, and may sit in the L3 cache if the cache is big enough. You end-up executing garbage... My proposed fix is to let kvmtool do the flushing, as we have userspace cache management operations for this exact purpose.

Why does this issue only apply to the loaded kernel and not to the zero bytes in the rest of RAM? I know executing zeroes isn't a very useful thing to do but it should be a well defined thing.

Good point, and not quite sure just yet. Probably we get a zeroed, clean page?

Anup, can you elaborate on how your L3 cache behaves?

Thanks,

-- Who you jivin' with that Cosmik Debris?

Anup Patel

4:36 p.m.

On Wed, Aug 14, 2013 at 9:27 PM, Marc Zyngier maz@misterjones.org wrote:

...

On 2013-08-14 16:41, Peter Maydell wrote:

...
On 14 August 2013 16:34, Marc Zyngier maz@misterjones.org wrote:

...
When userspace loads the kernel into memory, the kernel is not flushed to RAM, and may sit in the L3 cache if the cache is big enough. You end-up executing garbage... My proposed fix is to let kvmtool do the flushing, as we have userspace cache management operations for this exact purpose.

Why does this issue only apply to the loaded kernel and not to the zero bytes in the rest of RAM? I know executing zeroes isn't a very useful thing to do but it should be a well defined thing.

Good point, and not quite sure just yet. Probably we get a zeroed, clean page?

This would apply to zeroed pages too if some Guest OS expects zeroed-out portions in Guest RAM.

...

Anup, can you elaborate on how your L3 cache behaves?

The L3-cache is transparent to Aarch64 CPUs. It is only bypassed when caching is disabled or when accessing non-cacheable pages. Currently, we cannot disclose more details about line allocation and replacement policy because APM X-Gene specs are not released.

The issue pointed out by this patch would also apply to L2-cache (i.e. L3-cache disabled) but it usually does not manifest because L2-cache is not too big (few hundred KBs) and so dirty lines don't stay for long time in L2-cache.

...

Thanks,
     M.
-- Who you jivin' with that Cosmik Debris? _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Regards, Anup

Marc Zyngier

3:23 p.m.

On 2013-08-14 15:22, Anup Patel wrote:

...

On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. */
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) {       /* non ASID-tagged 
VIVT */ /* any kind of VIPT cache */ __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

Well, we have it on AArch64. Why would we blindly nuke the whole cache if we can do the right thing, efficiently, on the right range?

...

...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.

That sounds a lot better than randomly flushing the dcache for no good reason. Most of the time, your MMU will be ON, and only the initial text/data loaded from userspace is at risk.

But the userspace version sounds definitely better, and I'd like you to evaluate it before going the kernel way.

Thanks,

-- Fast, cheap, reliable. Pick two.

Peter Maydell

3:35 p.m.

On 14 August 2013 16:23, Marc Zyngier marc.zyngier@arm.com wrote:

...

On 2013-08-14 15:22, Anup Patel wrote:

...
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

Well, we have it on AArch64. Why would we blindly nuke the whole cache if we can do the right thing, efficiently, on the right range?

When exactly would userspace have to care about the cache? This patch isn't exactly clear about the circumstances. I think you'd need a really strong reason for not dealing with this in the kernel -- in general userspace KVM tools don't otherwise have to deal with cache maintenance at all.

-- PMM

Marc Zyngier

3:49 p.m.

On 2013-08-14 16:35, Peter Maydell wrote:

...

On 14 August 2013 16:23, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-14 15:22, Anup Patel wrote:

...
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

Well, we have it on AArch64. Why would we blindly nuke the whole cache if we can do the right thing, efficiently, on the right range?

When exactly would userspace have to care about the cache?

Only for the initial payload, I'd expect. Unless I've missed something more crucial?

...

This patch isn't exactly clear about the circumstances. I think you'd need a really strong reason for not dealing with this in the kernel -- in general userspace KVM tools don't otherwise have to deal with cache maintenance at all.

I believe we *could* do it in the kernel, just at the expense of a lot more CPU cycles.

A possible alternative would be to use HCR.DC, but I need to have a look and see what it breaks...

-- Fast, cheap, reliable. Pick two.

Christoffer Dall

5:34 p.m.

On Wed, Aug 14, 2013 at 04:49:24PM +0100, Marc Zyngier wrote:

...

On 2013-08-14 16:35, Peter Maydell wrote:

...
On 14 August 2013 16:23, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-14 15:22, Anup Patel wrote:

...
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

Well, we have it on AArch64. Why would we blindly nuke the whole cache if we can do the right thing, efficiently, on the right range?

When exactly would userspace have to care about the cache?

Only for the initial payload, I'd expect. Unless I've missed something more crucial?

What happens if the page is swapped out, is the kernel guaranteed to flush it all the way to RAM when it swaps it back in, or would KVM have to take care of that?

...

...
This patch isn't exactly clear about the circumstances. I think you'd need a really strong reason for not dealing with this in the kernel -- in general userspace KVM tools don't otherwise have to deal with cache maintenance at all.

I believe we *could* do it in the kernel, just at the expense of a lot more CPU cycles.

A possible alternative would be to use HCR.DC, but I need to have a look and see what it breaks...

Could we flush the cache when we fault in pages only if the guest has the MMU disabled and trap if the guest disabled the MMU and flush the whole outer dcache at that point in time?

-Christoffer

Marc Zyngier

15 Aug 15 Aug

4:44 a.m.

On 2013-08-14 18:34, Christoffer Dall wrote:

...

On Wed, Aug 14, 2013 at 04:49:24PM +0100, Marc Zyngier wrote:

...
On 2013-08-14 16:35, Peter Maydell wrote:

...
On 14 August 2013 16:23, Marc Zyngier marc.zyngier@arm.com

wrote:

...
...
On 2013-08-14 15:22, Anup Patel wrote:

...
It seems a bit unnatural to have cache cleaning is user-space. I

am

...
...
...
sure other architectures don't have such cache cleaning in user-space for KVM.

Well, we have it on AArch64. Why would we blindly nuke the whole cache if we can do the right thing, efficiently, on the right range?

When exactly would userspace have to care about the cache?

Only for the initial payload, I'd expect. Unless I've missed something more crucial?

What happens if the page is swapped out, is the kernel guaranteed to flush it all the way to RAM when it swaps it back in, or would KVM have to take care of that?

I'd expect the kernel to deal with it, otherwise you'd hit the wrong data each time you swap in a page in any userspace process.

...

...
...
This patch isn't exactly clear about the circumstances. I think you'd need a really strong reason for not dealing with this in the kernel -- in general userspace KVM tools don't otherwise have to deal with cache maintenance at all.

I believe we *could* do it in the kernel, just at the expense of a lot more CPU cycles.

A possible alternative would be to use HCR.DC, but I need to have a look and see what it breaks...

Could we flush the cache when we fault in pages only if the guest has the MMU disabled and trap if the guest disabled the MMU and flush the whole outer dcache at that point in time?

We don't need to trap the disabling of the MMU. If the guest does that, it *must* have flushed its cache to RAM already. Otherwise, it is utterly broken, virtualization or not.

What the guest doesn't expect is the initial data to sit in the cache while it hasn't set the MMU on just yet. I may have a patch for that.

-- Fast, cheap, reliable. Pick two.

Christoffer Dall

4:58 p.m.

On Thu, Aug 15, 2013 at 05:44:52AM +0100, Marc Zyngier wrote:

...

On 2013-08-14 18:34, Christoffer Dall wrote:

...
On Wed, Aug 14, 2013 at 04:49:24PM +0100, Marc Zyngier wrote:

...
On 2013-08-14 16:35, Peter Maydell wrote:

...
On 14 August 2013 16:23, Marc Zyngier marc.zyngier@arm.com

wrote:

...
...
On 2013-08-14 15:22, Anup Patel wrote:

...
It seems a bit unnatural to have cache cleaning is

user-space. I am

...
...
...
sure other architectures don't have such cache cleaning in user-space for KVM.

Well, we have it on AArch64. Why would we blindly nuke the whole cache if we can do the right thing, efficiently, on the right range?

When exactly would userspace have to care about the cache?

Only for the initial payload, I'd expect. Unless I've missed something more crucial?

What happens if the page is swapped out, is the kernel guaranteed to flush it all the way to RAM when it swaps it back in, or would KVM have to take care of that?

I'd expect the kernel to deal with it, otherwise you'd hit the wrong data each time you swap in a page in any userspace process.

Unless the kernel only does that for pages that are mapped executable, which it cannot know for VMs.

Admittedly I haven't looked at the code recently. Do you know?

...

...
...
...
This patch isn't exactly clear about the circumstances. I think you'd need a really strong reason for not dealing with this in the kernel -- in general userspace KVM tools don't otherwise have to deal with cache maintenance at all.

I believe we *could* do it in the kernel, just at the expense of a lot more CPU cycles.

A possible alternative would be to use HCR.DC, but I need to have a look and see what it breaks...

Could we flush the cache when we fault in pages only if the guest has the MMU disabled and trap if the guest disabled the MMU and flush the whole outer dcache at that point in time?

We don't need to trap the disabling of the MMU. If the guest does that, it *must* have flushed its cache to RAM already. Otherwise, it is utterly broken, virtualization or not.

What the guest doesn't expect is the initial data to sit in the cache while it hasn't set the MMU on just yet. I may have a patch for that.

Otherwise we can at least do this only when faulting in pages and the MMU is off then, correct?

Is it completely inconceivable that user space starts the guest and when some device is poked, it loads something else into memory that is going to be executed after the VM has been started, but before turning on the MMU? If not, then doing a single flush before starting the VM won't cut it.

-Christoffer

Anup Patel

14 Aug 14 Aug

3:36 p.m.

On Wed, Aug 14, 2013 at 8:53 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...

On 2013-08-14 15:22, Anup Patel wrote:

...
On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. */
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
        /* any kind of VIPT cache */
        __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.
Well, we have it on AArch64. Why would we blindly nuke the whole cache if we can do the right thing, efficiently, on the right range?

Flushing from user-space would be better only if target virtual address range is small. If user-space loads several images at different locations or large large images in guest RAM then flushing entire d-cache by set/way would be more efficient.

...

...
...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.

That sounds a lot better than randomly flushing the dcache for no good reason. Most of the time, your MMU will be ON, and only the initial text/data loaded from userspace is at risk.

But the userspace version sounds definitely better, and I'd like you to evaluate it before going the kernel way.

Considering the penalty, I like the approach of flushing d-cache by set/way upon first run of VCPU because its just one time penalty.

...

Thanks,
    M.
-- Fast, cheap, reliable. Pick two.

--Anup

Marc Zyngier

15 Aug 15 Aug

4:52 a.m.

Hi Anup,

On 2013-08-14 15:22, Anup Patel wrote:

...

On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. */
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) {       /* non ASID-tagged 
VIVT */ /* any kind of VIPT cache */ __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.

Can you please give the attached patch a spin on your HW? I've boot-tested it on a model, but of course I can't really verify that it fixes your issue.

As far as I can see, it should fix it without any additional flushing.

Please let me know how it goes.

Thanks,

-- Fast, cheap, reliable. Pick two.

Anup Patel

6:26 a.m.

Hi Marc,

On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier marc.zyngier@arm.com wrote:

...

Hi Anup,

On 2013-08-14 15:22, Anup Patel wrote:

...
On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. */
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
        /* any kind of VIPT cache */
        __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.
Can you please give the attached patch a spin on your HW? I've boot-tested it on a model, but of course I can't really verify that it fixes your issue.

As far as I can see, it should fix it without any additional flushing.

Please let me know how it goes.

HCR_EL2.DC=1 means all memory access for Stage1 MMU off are treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, Outer Write-Back Write-Allocate memory"

HCR_EL2.DC=0 means all memory access for Stage1 MMU off are treated as "Strongly-ordered device memory"

Now if Guest/VM access hardware MMIO devices directly (such as VGIC CPU interface) with MMU off then MMIO devices will be accessed as normal memory when HCR_EL2.DC=1.

The HCR_EL2.DC=1 makes sense only if we have all software emulated devices for Guest/VM which is not true for KVM ARM or KVM ARM64 because we use VGIC.

IMHO, this patch enforces incorrect memory attribute for Guest/VM when Stage1 MMU is off.

Nevertheless, we will still try your patch.

...

Thanks,
    M.
-- Fast, cheap, reliable. Pick two.

Regards, Anup

Marc Zyngier

8:31 a.m.

On 2013-08-15 07:26, Anup Patel wrote:

...

Hi Marc,

On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Anup,

On 2013-08-14 15:22, Anup Patel wrote:

...
On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. 
*/
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) {       /* non ASID-tagged 
VIVT */ /* any kind of VIPT cache */ __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.
Can you please give the attached patch a spin on your HW? I've boot-tested it on a model, but of course I can't really verify that it fixes your issue.

As far as I can see, it should fix it without any additional flushing.

Please let me know how it goes.
HCR_EL2.DC=1 means all memory access for Stage1 MMU off are treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, Outer Write-Back Write-Allocate memory"

HCR_EL2.DC=0 means all memory access for Stage1 MMU off are treated as "Strongly-ordered device memory"

Now if Guest/VM access hardware MMIO devices directly (such as VGIC CPU interface) with MMU off then MMIO devices will be accessed as normal memory when HCR_EL2.DC=1.

I don't think so. Stage-2 still applies, and should force MMIO to be accessed as device memory.

...

The HCR_EL2.DC=1 makes sense only if we have all software emulated devices for Guest/VM which is not true for KVM ARM or KVM ARM64 because we use VGIC.

IMHO, this patch enforces incorrect memory attribute for Guest/VM when Stage1 MMU is off.

See above. My understanding is that HCR.DC controls the default output of Stage-1, and Stage-2 overrides still apply.

...

Nevertheless, we will still try your patch.

Thanks.

-- Fast, cheap, reliable. Pick two.

Anup Patel

1:31 p.m.

On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...

On 2013-08-15 07:26, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Anup,

On 2013-08-14 15:22, Anup Patel wrote:

...
On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. */
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT
*/ /* any kind of VIPT cache */ __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.
Can you please give the attached patch a spin on your HW? I've boot-tested it on a model, but of course I can't really verify that it fixes your issue.

As far as I can see, it should fix it without any additional flushing.

Please let me know how it goes.
HCR_EL2.DC=1 means all memory access for Stage1 MMU off are treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, Outer Write-Back Write-Allocate memory"

HCR_EL2.DC=0 means all memory access for Stage1 MMU off are treated as "Strongly-ordered device memory"

Now if Guest/VM access hardware MMIO devices directly (such as VGIC CPU interface) with MMU off then MMIO devices will be accessed as normal memory when HCR_EL2.DC=1.
I don't think so. Stage-2 still applies, and should force MMIO to be accessed as device memory.

...
The HCR_EL2.DC=1 makes sense only if we have all software emulated devices for Guest/VM which is not true for KVM ARM or KVM ARM64 because we use VGIC.

IMHO, this patch enforces incorrect memory attribute for Guest/VM when Stage1 MMU is off.

See above. My understanding is that HCR.DC controls the default output of Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

...

...
Nevertheless, we will still try your patch.

Thanks.
    M.
-- Fast, cheap, reliable. Pick two.

Regards, Anup

Marc Zyngier

2:47 p.m.

On 2013-08-15 14:31, Anup Patel wrote:

...

On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 07:26, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Anup,

On 2013-08-14 15:22, Anup Patel wrote:

...
On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > > > Systems with large external L3-cache (few MBs), might have > dirty > content belonging to the guest page in L3-cache. To tackle > this, > we need to flush such dirty content from d-cache so that guest > will see correct contents of guest page when guest MMU is > disabled. > > The patch fixes coherent_icache_guest_page() for external > L3-cache. > > Signed-off-by: Pranavkumar Sawargaonkar > pranavkumar@linaro.org > Signed-off-by: Anup Patel anup.patel@linaro.org > --- > arch/arm64/include/asm/kvm_mmu.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/arm64/include/asm/kvm_mmu.h > b/arch/arm64/include/asm/kvm_mmu.h > index efe609c..5129038 100644 > --- a/arch/arm64/include/asm/kvm_mmu.h > +++ b/arch/arm64/include/asm/kvm_mmu.h > @@ -123,6 +123,8 @@ static inline void > coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > if (!icache_is_aliasing()) { /* PIPT */ > unsigned long hva = gfn_to_hva(kvm, gfn); > flush_icache_range(hva, hva + PAGE_SIZE); > + /* Flush d-cache for systems with external > caches. */ > + __flush_dcache_area((void *) hva, PAGE_SIZE); > } else if (!icache_is_aivivt()) { /* non > ASID-tagged VIVT > */ > /* any kind of VIPT cache */ > __flush_icache_all();

[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.

It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.

Can you please give the attached patch a spin on your HW? I've boot-tested it on a model, but of course I can't really verify that it fixes your issue.

As far as I can see, it should fix it without any additional flushing.

Please let me know how it goes.

HCR_EL2.DC=1 means all memory access for Stage1 MMU off are treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, Outer Write-Back Write-Allocate memory"

HCR_EL2.DC=0 means all memory access for Stage1 MMU off are treated as "Strongly-ordered device memory"

Now if Guest/VM access hardware MMIO devices directly (such as VGIC CPU interface) with MMU off then MMIO devices will be accessed as normal memory when HCR_EL2.DC=1.

I don't think so. Stage-2 still applies, and should force MMIO to be accessed as device memory.

...
The HCR_EL2.DC=1 makes sense only if we have all software emulated devices for Guest/VM which is not true for KVM ARM or KVM ARM64 because we use VGIC.

IMHO, this patch enforces incorrect memory attribute for Guest/VM when Stage1 MMU is off.

See above. My understanding is that HCR.DC controls the default output of Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.

-- Fast, cheap, reliable. Pick two.

Anup Patel

3:13 p.m.

Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...

On 2013-08-15 14:31, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 07:26, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Anup,

On 2013-08-14 15:22, Anup Patel wrote:

...
On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote: > > > > Hi Pranav, > > On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >> >> >> >> Systems with large external L3-cache (few MBs), might have dirty >> content belonging to the guest page in L3-cache. To tackle this, >> we need to flush such dirty content from d-cache so that guest >> will see correct contents of guest page when guest MMU is disabled. >> >> The patch fixes coherent_icache_guest_page() for external L3-cache. >> >> Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org >> Signed-off-by: Anup Patel anup.patel@linaro.org >> --- >> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/arch/arm64/include/asm/kvm_mmu.h >> b/arch/arm64/include/asm/kvm_mmu.h >> index efe609c..5129038 100644 >> --- a/arch/arm64/include/asm/kvm_mmu.h >> +++ b/arch/arm64/include/asm/kvm_mmu.h >> @@ -123,6 +123,8 @@ static inline void >> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >> if (!icache_is_aliasing()) { /* PIPT */ >> unsigned long hva = gfn_to_hva(kvm, gfn); >> flush_icache_range(hva, hva + PAGE_SIZE); >> + /* Flush d-cache for systems with external caches. */ >> + __flush_dcache_area((void *) hva, PAGE_SIZE); >> } else if (!icache_is_aivivt()) { /* non ASID-tagged >> VIVT >> */ >> /* any kind of VIPT cache */ >> __flush_icache_all(); > > > > > [adding Will to the discussion as we talked about this in the past] > > That's definitely an issue, but I'm not sure the fix is to hit the > data > cache on each page mapping. It looks overkill. > > Wouldn't it be enough to let userspace do the cache cleaning? > kvmtools > knows which bits of the guest memory have been touched, and can do a > "DC > DVAC" on this region.

It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

> > The alternative is do it in the kernel before running any vcpu - but > that's not very nice either (have to clean the whole of the guest > memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.

Can you please give the attached patch a spin on your HW? I've boot-tested it on a model, but of course I can't really verify that it fixes your issue.

As far as I can see, it should fix it without any additional flushing.

Please let me know how it goes.

HCR_EL2.DC=1 means all memory access for Stage1 MMU off are treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, Outer Write-Back Write-Allocate memory"

HCR_EL2.DC=0 means all memory access for Stage1 MMU off are treated as "Strongly-ordered device memory"

Now if Guest/VM access hardware MMIO devices directly (such as VGIC CPU interface) with MMU off then MMIO devices will be accessed as normal memory when HCR_EL2.DC=1.

I don't think so. Stage-2 still applies, and should force MMIO to be accessed as device memory.

...
The HCR_EL2.DC=1 makes sense only if we have all software emulated devices for Guest/VM which is not true for KVM ARM or KVM ARM64 because we use VGIC.

IMHO, this patch enforces incorrect memory attribute for Guest/VM when Stage1 MMU is off.

See above. My understanding is that HCR.DC controls the default output of Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.
    M.
-- Fast, cheap, reliable. Pick two.

HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).

Regards, Anup

Marc Zyngier

3:37 p.m.

On 2013-08-15 16:13, Anup Patel wrote:

...

Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 14:31, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 07:26, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Anup,

On 2013-08-14 15:22, Anup Patel wrote: > > > > On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > marc.zyngier@arm.com > wrote: >> >> >> >> Hi Pranav, >> >> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>> >>> >>> >>> Systems with large external L3-cache (few MBs), might have >>> dirty >>> content belonging to the guest page in L3-cache. To tackle >>> this, >>> we need to flush such dirty content from d-cache so that >>> guest >>> will see correct contents of guest page when guest MMU is >>> disabled. >>> >>> The patch fixes coherent_icache_guest_page() for external >>> L3-cache. >>> >>> Signed-off-by: Pranavkumar Sawargaonkar >>> pranavkumar@linaro.org >>> Signed-off-by: Anup Patel anup.patel@linaro.org >>> --- >>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>> b/arch/arm64/include/asm/kvm_mmu.h >>> index efe609c..5129038 100644 >>> --- a/arch/arm64/include/asm/kvm_mmu.h >>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>> @@ -123,6 +123,8 @@ static inline void >>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>> if (!icache_is_aliasing()) { /* PIPT */ >>> unsigned long hva = gfn_to_hva(kvm, gfn); >>> flush_icache_range(hva, hva + PAGE_SIZE); >>> + /* Flush d-cache for systems with external >>> caches. */ >>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>> } else if (!icache_is_aivivt()) { /* non >>> ASID-tagged >>> VIVT >>> */ >>> /* any kind of VIPT cache */ >>> __flush_icache_all(); >> >> >> >> >> [adding Will to the discussion as we talked about this in the >> past] >> >> That's definitely an issue, but I'm not sure the fix is to hit >> the >> data >> cache on each page mapping. It looks overkill. >> >> Wouldn't it be enough to let userspace do the cache cleaning? >> kvmtools >> knows which bits of the guest memory have been touched, and >> can do a >> "DC >> DVAC" on this region. > > > > > It seems a bit unnatural to have cache cleaning is user-space. > I am > sure > other architectures don't have such cache cleaning in > user-space for > KVM. > >> >> The alternative is do it in the kernel before running any vcpu >> - but >> that's not very nice either (have to clean the whole of the >> guest >> memory, which makes a full dcache clean more appealing). > > > > > Actually, cleaning full d-cache by set/way upon first run of > VCPU was > our second option but current approach seemed very simple hence > we went for this. > > If more people vote for full d-cache clean upon first run of > VCPU then > we should revise this patch.

Can you please give the attached patch a spin on your HW? I've boot-tested it on a model, but of course I can't really verify that it fixes your issue.

As far as I can see, it should fix it without any additional flushing.

Please let me know how it goes.

HCR_EL2.DC=1 means all memory access for Stage1 MMU off are treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, Outer Write-Back Write-Allocate memory"

HCR_EL2.DC=0 means all memory access for Stage1 MMU off are treated as "Strongly-ordered device memory"

Now if Guest/VM access hardware MMIO devices directly (such as VGIC CPU interface) with MMU off then MMIO devices will be accessed as normal memory when HCR_EL2.DC=1.

I don't think so. Stage-2 still applies, and should force MMIO to be accessed as device memory.

...
The HCR_EL2.DC=1 makes sense only if we have all software emulated devices for Guest/VM which is not true for KVM ARM or KVM ARM64 because we use VGIC.

IMHO, this patch enforces incorrect memory attribute for Guest/VM when Stage1 MMU is off.

See above. My understanding is that HCR.DC controls the default output of Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.
    M.
-- Fast, cheap, reliable. Pick two.
HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).

OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?

Thanks,

-- Fast, cheap, reliable. Pick two.

Anup Patel

3:45 p.m.

On Thu, Aug 15, 2013 at 9:07 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...

On 2013-08-15 16:13, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 14:31, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 07:26, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier marc.zyngier@arm.com wrote: > > > > Hi Anup, > > > On 2013-08-14 15:22, Anup Patel wrote: >> >> >> >> >> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com >> wrote: >>> >>> >>> >>> >>> Hi Pranav, >>> >>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>> >>>> >>>> >>>> >>>> Systems with large external L3-cache (few MBs), might have dirty >>>> content belonging to the guest page in L3-cache. To tackle this, >>>> we need to flush such dirty content from d-cache so that guest >>>> will see correct contents of guest page when guest MMU is >>>> disabled. >>>> >>>> The patch fixes coherent_icache_guest_page() for external >>>> L3-cache. >>>> >>>> Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org >>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>> --- >>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>> b/arch/arm64/include/asm/kvm_mmu.h >>>> index efe609c..5129038 100644 >>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>> @@ -123,6 +123,8 @@ static inline void >>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>> if (!icache_is_aliasing()) { /* PIPT */ >>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>> + /* Flush d-cache for systems with external caches. >>>> */ >>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>> } else if (!icache_is_aivivt()) { /* non ASID-tagged >>>> VIVT >>>> */ >>>> /* any kind of VIPT cache */ >>>> __flush_icache_all(); >>> >>> >>> >>> >>> >>> [adding Will to the discussion as we talked about this in the past] >>> >>> That's definitely an issue, but I'm not sure the fix is to hit the >>> data >>> cache on each page mapping. It looks overkill. >>> >>> Wouldn't it be enough to let userspace do the cache cleaning? >>> kvmtools >>> knows which bits of the guest memory have been touched, and can do >>> a >>> "DC >>> DVAC" on this region. >> >> >> >> >> >> It seems a bit unnatural to have cache cleaning is user-space. I am >> sure >> other architectures don't have such cache cleaning in user-space for >> KVM. >> >>> >>> The alternative is do it in the kernel before running any vcpu - >>> but >>> that's not very nice either (have to clean the whole of the guest >>> memory, which makes a full dcache clean more appealing). >> >> >> >> >> >> Actually, cleaning full d-cache by set/way upon first run of VCPU >> was >> our second option but current approach seemed very simple hence >> we went for this. >> >> If more people vote for full d-cache clean upon first run of VCPU >> then >> we should revise this patch. > > > > > > Can you please give the attached patch a spin on your HW? I've > boot-tested > it on a model, but of course I can't really verify that it fixes your > issue. > > As far as I can see, it should fix it without any additional > flushing. > > Please let me know how it goes.

HCR_EL2.DC=1 means all memory access for Stage1 MMU off are treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, Outer Write-Back Write-Allocate memory"

HCR_EL2.DC=0 means all memory access for Stage1 MMU off are treated as "Strongly-ordered device memory"

Now if Guest/VM access hardware MMIO devices directly (such as VGIC CPU interface) with MMU off then MMIO devices will be accessed as normal memory when HCR_EL2.DC=1.

I don't think so. Stage-2 still applies, and should force MMIO to be accessed as device memory.

...
The HCR_EL2.DC=1 makes sense only if we have all software emulated devices for Guest/VM which is not true for KVM ARM or KVM ARM64 because we use VGIC.

IMHO, this patch enforces incorrect memory attribute for Guest/VM when Stage1 MMU is off.

See above. My understanding is that HCR.DC controls the default output of Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.
    M.
-- Fast, cheap, reliable. Pick two.
HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have?

Actually, this would be a typical use-case in embedded virtualization. Other devices that fit this use-case would be network controllers, security engines, or some proprietary HW not having drivers in Linux.

...

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?

Thanks,
    M.
-- Fast, cheap, reliable. Pick two.

--Anup

Christoffer Dall

4:53 p.m.

On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...

On 2013-08-15 16:13, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 14:31, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 07:26, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier marc.zyngier@arm.com wrote: > > > Hi Anup, > > > On 2013-08-14 15:22, Anup Patel wrote: >> >> >> >> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >> marc.zyngier@arm.com >> wrote: >>> >>> >>> >>> Hi Pranav, >>> >>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>> >>>> >>>> >>>> Systems with large external L3-cache (few MBs), might have >>>> dirty >>>> content belonging to the guest page in L3-cache. To tackle >>>> this, >>>> we need to flush such dirty content from d-cache so that >>>> guest >>>> will see correct contents of guest page when guest MMU is >>>> disabled. >>>> >>>> The patch fixes coherent_icache_guest_page() for external >>>> L3-cache. >>>> >>>> Signed-off-by: Pranavkumar Sawargaonkar >>>> pranavkumar@linaro.org >>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>> --- >>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>> b/arch/arm64/include/asm/kvm_mmu.h >>>> index efe609c..5129038 100644 >>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>> @@ -123,6 +123,8 @@ static inline void >>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>> if (!icache_is_aliasing()) { /* PIPT */ >>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>> + /* Flush d-cache for systems with external >>>> caches. */ >>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>> } else if (!icache_is_aivivt()) { /* non >>>> ASID-tagged >>>> VIVT >>>> */ >>>> /* any kind of VIPT cache */ >>>> __flush_icache_all(); >>> >>> >>> >>> >>> [adding Will to the discussion as we talked about this in the >>> past] >>> >>> That's definitely an issue, but I'm not sure the fix is to hit >>> the >>> data >>> cache on each page mapping. It looks overkill. >>> >>> Wouldn't it be enough to let userspace do the cache cleaning? >>> kvmtools >>> knows which bits of the guest memory have been touched, and >>> can do a >>> "DC >>> DVAC" on this region. >> >> >> >> >> It seems a bit unnatural to have cache cleaning is user-space. >> I am >> sure >> other architectures don't have such cache cleaning in >> user-space for >> KVM. >> >>> >>> The alternative is do it in the kernel before running any vcpu >>> - but >>> that's not very nice either (have to clean the whole of the >>> guest >>> memory, which makes a full dcache clean more appealing). >> >> >> >> >> Actually, cleaning full d-cache by set/way upon first run of >> VCPU was >> our second option but current approach seemed very simple hence >> we went for this. >> >> If more people vote for full d-cache clean upon first run of >> VCPU then >> we should revise this patch. > > > > > Can you please give the attached patch a spin on your HW? I've > boot-tested > it on a model, but of course I can't really verify that it fixes > your > issue. > > As far as I can see, it should fix it without any additional > flushing. > > Please let me know how it goes.

HCR_EL2.DC=1 means all memory access for Stage1 MMU off are treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, Outer Write-Back Write-Allocate memory"

HCR_EL2.DC=0 means all memory access for Stage1 MMU off are treated as "Strongly-ordered device memory"

Now if Guest/VM access hardware MMIO devices directly (such as VGIC CPU interface) with MMU off then MMIO devices will be accessed as normal memory when HCR_EL2.DC=1.

I don't think so. Stage-2 still applies, and should force MMIO to be accessed as device memory.

...
The HCR_EL2.DC=1 makes sense only if we have all software emulated devices for Guest/VM which is not true for KVM ARM or KVM ARM64 because we use VGIC.

IMHO, this patch enforces incorrect memory attribute for Guest/VM when Stage1 MMU is off.

See above. My understanding is that HCR.DC controls the default output of Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.
    M.
-- Fast, cheap, reliable. Pick two.
HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?

Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

-Christoffer

Anup Patel

16 Aug 16 Aug

5:02 a.m.

On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...

On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...
On 2013-08-15 16:13, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 14:31, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 07:26, Anup Patel wrote: > > > Hi Marc, > > On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > marc.zyngier@arm.com > wrote: >> >> >> Hi Anup, >> >> >> On 2013-08-14 15:22, Anup Patel wrote: >>> >>> >>> >>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>> marc.zyngier@arm.com >>> wrote: >>>> >>>> >>>> >>>> Hi Pranav, >>>> >>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>> >>>>> >>>>> >>>>> Systems with large external L3-cache (few MBs), might have >>>>> dirty >>>>> content belonging to the guest page in L3-cache. To tackle >>>>> this, >>>>> we need to flush such dirty content from d-cache so that >>>>> guest >>>>> will see correct contents of guest page when guest MMU is >>>>> disabled. >>>>> >>>>> The patch fixes coherent_icache_guest_page() for external >>>>> L3-cache. >>>>> >>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>> pranavkumar@linaro.org >>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>>> --- >>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>> index efe609c..5129038 100644 >>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>> @@ -123,6 +123,8 @@ static inline void >>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>> + /* Flush d-cache for systems with external >>>>> caches. */ >>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>> } else if (!icache_is_aivivt()) { /* non >>>>> ASID-tagged >>>>> VIVT >>>>> */ >>>>> /* any kind of VIPT cache */ >>>>> __flush_icache_all(); >>>> >>>> >>>> >>>> >>>> [adding Will to the discussion as we talked about this in the >>>> past] >>>> >>>> That's definitely an issue, but I'm not sure the fix is to hit >>>> the >>>> data >>>> cache on each page mapping. It looks overkill. >>>> >>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>> kvmtools >>>> knows which bits of the guest memory have been touched, and >>>> can do a >>>> "DC >>>> DVAC" on this region. >>> >>> >>> >>> >>> It seems a bit unnatural to have cache cleaning is user-space. >>> I am >>> sure >>> other architectures don't have such cache cleaning in >>> user-space for >>> KVM. >>> >>>> >>>> The alternative is do it in the kernel before running any vcpu >>>> - but >>>> that's not very nice either (have to clean the whole of the >>>> guest >>>> memory, which makes a full dcache clean more appealing). >>> >>> >>> >>> >>> Actually, cleaning full d-cache by set/way upon first run of >>> VCPU was >>> our second option but current approach seemed very simple hence >>> we went for this. >>> >>> If more people vote for full d-cache clean upon first run of >>> VCPU then >>> we should revise this patch. >> >> >> >> >> Can you please give the attached patch a spin on your HW? I've >> boot-tested >> it on a model, but of course I can't really verify that it fixes >> your >> issue. >> >> As far as I can see, it should fix it without any additional >> flushing. >> >> Please let me know how it goes. > > > > HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > treated as "Normal Non-shareable, Inner Write-Back > Write-Allocate, > Outer Write-Back Write-Allocate memory" > > HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > treated as "Strongly-ordered device memory" > > Now if Guest/VM access hardware MMIO devices directly (such as > VGIC CPU interface) with MMU off then MMIO devices will be > accessed as normal memory when HCR_EL2.DC=1.

I don't think so. Stage-2 still applies, and should force MMIO to be accessed as device memory.

> The HCR_EL2.DC=1 makes sense only if we have all software > emulated devices for Guest/VM which is not true for KVM ARM or > KVM ARM64 because we use VGIC. > > IMHO, this patch enforces incorrect memory attribute for Guest/VM > when Stage1 MMU is off.

See above. My understanding is that HCR.DC controls the default output of Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.
    M.
-- Fast, cheap, reliable. Pick two.
HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?
Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

-Christoffer

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

--Anup

Anup Patel

6:57 a.m.

On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel anup@brainfault.org wrote:

...

On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...
On 2013-08-15 16:13, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 14:31, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier marc.zyngier@arm.com wrote: > > On 2013-08-15 07:26, Anup Patel wrote: >> >> >> Hi Marc, >> >> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >> marc.zyngier@arm.com >> wrote: >>> >>> >>> Hi Anup, >>> >>> >>> On 2013-08-14 15:22, Anup Patel wrote: >>>> >>>> >>>> >>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>> marc.zyngier@arm.com >>>> wrote: >>>>> >>>>> >>>>> >>>>> Hi Pranav, >>>>> >>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>> >>>>>> >>>>>> >>>>>> Systems with large external L3-cache (few MBs), might have >>>>>> dirty >>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>> this, >>>>>> we need to flush such dirty content from d-cache so that >>>>>> guest >>>>>> will see correct contents of guest page when guest MMU is >>>>>> disabled. >>>>>> >>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>> L3-cache. >>>>>> >>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>> pranavkumar@linaro.org >>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>>>> --- >>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>> 1 file changed, 2 insertions(+) >>>>>> >>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>> index efe609c..5129038 100644 >>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>> + /* Flush d-cache for systems with external >>>>>> caches. */ >>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>> ASID-tagged >>>>>> VIVT >>>>>> */ >>>>>> /* any kind of VIPT cache */ >>>>>> __flush_icache_all(); >>>>> >>>>> >>>>> >>>>> >>>>> [adding Will to the discussion as we talked about this in the >>>>> past] >>>>> >>>>> That's definitely an issue, but I'm not sure the fix is to hit >>>>> the >>>>> data >>>>> cache on each page mapping. It looks overkill. >>>>> >>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>> kvmtools >>>>> knows which bits of the guest memory have been touched, and >>>>> can do a >>>>> "DC >>>>> DVAC" on this region. >>>> >>>> >>>> >>>> >>>> It seems a bit unnatural to have cache cleaning is user-space. >>>> I am >>>> sure >>>> other architectures don't have such cache cleaning in >>>> user-space for >>>> KVM. >>>> >>>>> >>>>> The alternative is do it in the kernel before running any vcpu >>>>> - but >>>>> that's not very nice either (have to clean the whole of the >>>>> guest >>>>> memory, which makes a full dcache clean more appealing). >>>> >>>> >>>> >>>> >>>> Actually, cleaning full d-cache by set/way upon first run of >>>> VCPU was >>>> our second option but current approach seemed very simple hence >>>> we went for this. >>>> >>>> If more people vote for full d-cache clean upon first run of >>>> VCPU then >>>> we should revise this patch. >>> >>> >>> >>> >>> Can you please give the attached patch a spin on your HW? I've >>> boot-tested >>> it on a model, but of course I can't really verify that it fixes >>> your >>> issue. >>> >>> As far as I can see, it should fix it without any additional >>> flushing. >>> >>> Please let me know how it goes. >> >> >> >> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >> treated as "Normal Non-shareable, Inner Write-Back >> Write-Allocate, >> Outer Write-Back Write-Allocate memory" >> >> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >> treated as "Strongly-ordered device memory" >> >> Now if Guest/VM access hardware MMIO devices directly (such as >> VGIC CPU interface) with MMU off then MMIO devices will be >> accessed as normal memory when HCR_EL2.DC=1. > > > > I don't think so. Stage-2 still applies, and should force MMIO to > be > accessed as device memory. > > >> The HCR_EL2.DC=1 makes sense only if we have all software >> emulated devices for Guest/VM which is not true for KVM ARM or >> KVM ARM64 because we use VGIC. >> >> IMHO, this patch enforces incorrect memory attribute for Guest/VM >> when Stage1 MMU is off. > > > > See above. My understanding is that HCR.DC controls the default > output of > Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.
    M.
-- Fast, cheap, reliable. Pick two.
HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?
Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

-Christoffer
Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

--Anup

The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

IMHO, we are left with following options: 1. Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU 2. Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU 3. Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

Regards, Anup

Christoffer Dall

5:19 p.m.

On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:

...

On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel anup@brainfault.org wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...
On 2013-08-15 16:13, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 14:31, Anup Patel wrote: > > On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > marc.zyngier@arm.com > wrote: >> >> On 2013-08-15 07:26, Anup Patel wrote: >>> >>> >>> Hi Marc, >>> >>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >>> marc.zyngier@arm.com >>> wrote: >>>> >>>> >>>> Hi Anup, >>>> >>>> >>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>> >>>>> >>>>> >>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>>> marc.zyngier@arm.com >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi Pranav, >>>>>> >>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Systems with large external L3-cache (few MBs), might have >>>>>>> dirty >>>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>>> this, >>>>>>> we need to flush such dirty content from d-cache so that >>>>>>> guest >>>>>>> will see correct contents of guest page when guest MMU is >>>>>>> disabled. >>>>>>> >>>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>>> L3-cache. >>>>>>> >>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>>> pranavkumar@linaro.org >>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>>>>> --- >>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>> 1 file changed, 2 insertions(+) >>>>>>> >>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>> index efe609c..5129038 100644 >>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>> + /* Flush d-cache for systems with external >>>>>>> caches. */ >>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>>> ASID-tagged >>>>>>> VIVT >>>>>>> */ >>>>>>> /* any kind of VIPT cache */ >>>>>>> __flush_icache_all(); >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> [adding Will to the discussion as we talked about this in the >>>>>> past] >>>>>> >>>>>> That's definitely an issue, but I'm not sure the fix is to hit >>>>>> the >>>>>> data >>>>>> cache on each page mapping. It looks overkill. >>>>>> >>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>> kvmtools >>>>>> knows which bits of the guest memory have been touched, and >>>>>> can do a >>>>>> "DC >>>>>> DVAC" on this region. >>>>> >>>>> >>>>> >>>>> >>>>> It seems a bit unnatural to have cache cleaning is user-space. >>>>> I am >>>>> sure >>>>> other architectures don't have such cache cleaning in >>>>> user-space for >>>>> KVM. >>>>> >>>>>> >>>>>> The alternative is do it in the kernel before running any vcpu >>>>>> - but >>>>>> that's not very nice either (have to clean the whole of the >>>>>> guest >>>>>> memory, which makes a full dcache clean more appealing). >>>>> >>>>> >>>>> >>>>> >>>>> Actually, cleaning full d-cache by set/way upon first run of >>>>> VCPU was >>>>> our second option but current approach seemed very simple hence >>>>> we went for this. >>>>> >>>>> If more people vote for full d-cache clean upon first run of >>>>> VCPU then >>>>> we should revise this patch. >>>> >>>> >>>> >>>> >>>> Can you please give the attached patch a spin on your HW? I've >>>> boot-tested >>>> it on a model, but of course I can't really verify that it fixes >>>> your >>>> issue. >>>> >>>> As far as I can see, it should fix it without any additional >>>> flushing. >>>> >>>> Please let me know how it goes. >>> >>> >>> >>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>> treated as "Normal Non-shareable, Inner Write-Back >>> Write-Allocate, >>> Outer Write-Back Write-Allocate memory" >>> >>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>> treated as "Strongly-ordered device memory" >>> >>> Now if Guest/VM access hardware MMIO devices directly (such as >>> VGIC CPU interface) with MMU off then MMIO devices will be >>> accessed as normal memory when HCR_EL2.DC=1. >> >> >> >> I don't think so. Stage-2 still applies, and should force MMIO to >> be >> accessed as device memory. >> >> >>> The HCR_EL2.DC=1 makes sense only if we have all software >>> emulated devices for Guest/VM which is not true for KVM ARM or >>> KVM ARM64 because we use VGIC. >>> >>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >>> when Stage1 MMU is off. >> >> >> >> See above. My understanding is that HCR.DC controls the default >> output of >> Stage-1, and Stage-2 overrides still apply. > > > You had mentioned that PAGE_S2_DEVICE attribute was redundant > and wanted guest to decide the memory attribute. In other words, > you > did not want to enforce any memory attribute in Stage2. > > Please refer to this patch > https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.
    M.
-- Fast, cheap, reliable. Pick two.
HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?
Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

-Christoffer
Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

--Anup
The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

When did anyone suggest using a cache cleaning method that doesn't apply to this case. I'm a little confused about your comment here.

...

IMHO, we are left with following options:

Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU

We could do that, but we have to ensure that no other memory regions can be added later. Is this the case? I don't remember.

...

Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU

What's the difference between (2) and (1)?

...

Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

I'm sorry, but I'm not following this discussion. Are we not mixing a discussion along two different axis here? As I see it there are two separate issues:

(A) When/Where to flush the memory regions (B) How to flush the memory regions, including the hardware method and the kernel software engineering aspect.

-Christoffer

Anup Patel

5:42 p.m.

On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...

On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel anup@brainfault.org wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...
On 2013-08-15 16:13, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com wrote: > On 2013-08-15 14:31, Anup Patel wrote: >> >> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >> marc.zyngier@arm.com >> wrote: >>> >>> On 2013-08-15 07:26, Anup Patel wrote: >>>> >>>> >>>> Hi Marc, >>>> >>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >>>> marc.zyngier@arm.com >>>> wrote: >>>>> >>>>> >>>>> Hi Anup, >>>>> >>>>> >>>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>>>> marc.zyngier@arm.com >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Pranav, >>>>>>> >>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Systems with large external L3-cache (few MBs), might have >>>>>>>> dirty >>>>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>>>> this, >>>>>>>> we need to flush such dirty content from d-cache so that >>>>>>>> guest >>>>>>>> will see correct contents of guest page when guest MMU is >>>>>>>> disabled. >>>>>>>> >>>>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>>>> L3-cache. >>>>>>>> >>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>>>> pranavkumar@linaro.org >>>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>>>>>> --- >>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>> >>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> index efe609c..5129038 100644 >>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>>> + /* Flush d-cache for systems with external >>>>>>>> caches. */ >>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>>>> ASID-tagged >>>>>>>> VIVT >>>>>>>> */ >>>>>>>> /* any kind of VIPT cache */ >>>>>>>> __flush_icache_all(); >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> [adding Will to the discussion as we talked about this in the >>>>>>> past] >>>>>>> >>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >>>>>>> the >>>>>>> data >>>>>>> cache on each page mapping. It looks overkill. >>>>>>> >>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>>> kvmtools >>>>>>> knows which bits of the guest memory have been touched, and >>>>>>> can do a >>>>>>> "DC >>>>>>> DVAC" on this region. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> It seems a bit unnatural to have cache cleaning is user-space. >>>>>> I am >>>>>> sure >>>>>> other architectures don't have such cache cleaning in >>>>>> user-space for >>>>>> KVM. >>>>>> >>>>>>> >>>>>>> The alternative is do it in the kernel before running any vcpu >>>>>>> - but >>>>>>> that's not very nice either (have to clean the whole of the >>>>>>> guest >>>>>>> memory, which makes a full dcache clean more appealing). >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Actually, cleaning full d-cache by set/way upon first run of >>>>>> VCPU was >>>>>> our second option but current approach seemed very simple hence >>>>>> we went for this. >>>>>> >>>>>> If more people vote for full d-cache clean upon first run of >>>>>> VCPU then >>>>>> we should revise this patch. >>>>> >>>>> >>>>> >>>>> >>>>> Can you please give the attached patch a spin on your HW? I've >>>>> boot-tested >>>>> it on a model, but of course I can't really verify that it fixes >>>>> your >>>>> issue. >>>>> >>>>> As far as I can see, it should fix it without any additional >>>>> flushing. >>>>> >>>>> Please let me know how it goes. >>>> >>>> >>>> >>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>>> treated as "Normal Non-shareable, Inner Write-Back >>>> Write-Allocate, >>>> Outer Write-Back Write-Allocate memory" >>>> >>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>>> treated as "Strongly-ordered device memory" >>>> >>>> Now if Guest/VM access hardware MMIO devices directly (such as >>>> VGIC CPU interface) with MMU off then MMIO devices will be >>>> accessed as normal memory when HCR_EL2.DC=1. >>> >>> >>> >>> I don't think so. Stage-2 still applies, and should force MMIO to >>> be >>> accessed as device memory. >>> >>> >>>> The HCR_EL2.DC=1 makes sense only if we have all software >>>> emulated devices for Guest/VM which is not true for KVM ARM or >>>> KVM ARM64 because we use VGIC. >>>> >>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >>>> when Stage1 MMU is off. >>> >>> >>> >>> See above. My understanding is that HCR.DC controls the default >>> output of >>> Stage-1, and Stage-2 overrides still apply. >> >> >> You had mentioned that PAGE_S2_DEVICE attribute was redundant >> and wanted guest to decide the memory attribute. In other words, >> you >> did not want to enforce any memory attribute in Stage2. >> >> Please refer to this patch >> https://patchwork.kernel.org/patch/2543201/ > > > This patch has never been merged. If you carry on following the > discussion, > you will certainly notice it was dropped for a very good reason: > https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > > So Stage-2 memory attributes are used, they are not going away, and > they are > essential to the patch I sent this morning. > > > M. > -- > Fast, cheap, reliable. Pick two.

HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).

OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?

Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

-Christoffer

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

--Anup

The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

When did anyone suggest using a cache cleaning method that doesn't apply to this case. I'm a little confused about your comment here.

Please refer last reply from Marc Z where he says: "Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?".

Rather doing __flush_dcache_range() on each page in coherent_icache_guest_page() we could also flush entire d-cache upon first VCPU run. This only issue in flushing d-cache upon first VCPU run is that we cannot flush d-cache by set/way because as per ARM ARM all operations by set/way are upto PoU and not PoC. In presence of L3-Cache we need to flush d-cache upto PoC so we have to use __flush_dache_range()

...

...
IMHO, we are left with following options:

Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU

We could do that, but we have to ensure that no other memory regions can be added later. Is this the case? I don't remember.

Yes, memory regions being added later be problem for this option.

...

...

Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU

What's the difference between (2) and (1)?

Linux ARM (i.e. 32-bit port) has a framework for having outer caches (i.e. caches that are not part of the CPU but exist as separate entity between CPU and Memory for performance improvement). Using this framework we can flush all CPU caches and outer cache.

...

...

Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

I'm sorry, but I'm not following this discussion. Are we not mixing a discussion along two different axis here? As I see it there are two separate issues:

(A) When/Where to flush the memory regions (B) How to flush the memory regions, including the hardware method and the kernel software engineering aspect.

Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU.

...

-Christoffer

--Anup

Christoffer Dall

5:50 p.m.

On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:

...

On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel anup@brainfault.org wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...
On 2013-08-15 16:13, Anup Patel wrote: > Hi Marc, > > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com > wrote: >> On 2013-08-15 14:31, Anup Patel wrote: >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >>> marc.zyngier@arm.com >>> wrote: >>>> >>>> On 2013-08-15 07:26, Anup Patel wrote: >>>>> >>>>> >>>>> Hi Marc, >>>>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >>>>> marc.zyngier@arm.com >>>>> wrote: >>>>>> >>>>>> >>>>>> Hi Anup, >>>>>> >>>>>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>>>>> marc.zyngier@arm.com >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Pranav, >>>>>>>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >>>>>>>>> dirty >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>>>>> this, >>>>>>>>> we need to flush such dirty content from d-cache so that >>>>>>>>> guest >>>>>>>>> will see correct contents of guest page when guest MMU is >>>>>>>>> disabled. >>>>>>>>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>>>>> L3-cache. >>>>>>>>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>>>>> pranavkumar@linaro.org >>>>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>>>>>>> --- >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> index efe609c..5129038 100644 >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>>>> + /* Flush d-cache for systems with external >>>>>>>>> caches. */ >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>>>>> ASID-tagged >>>>>>>>> VIVT >>>>>>>>> */ >>>>>>>>> /* any kind of VIPT cache */ >>>>>>>>> __flush_icache_all(); >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [adding Will to the discussion as we talked about this in the >>>>>>>> past] >>>>>>>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >>>>>>>> the >>>>>>>> data >>>>>>>> cache on each page mapping. It looks overkill. >>>>>>>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>>>> kvmtools >>>>>>>> knows which bits of the guest memory have been touched, and >>>>>>>> can do a >>>>>>>> "DC >>>>>>>> DVAC" on this region. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >>>>>>> I am >>>>>>> sure >>>>>>> other architectures don't have such cache cleaning in >>>>>>> user-space for >>>>>>> KVM. >>>>>>> >>>>>>>> >>>>>>>> The alternative is do it in the kernel before running any vcpu >>>>>>>> - but >>>>>>>> that's not very nice either (have to clean the whole of the >>>>>>>> guest >>>>>>>> memory, which makes a full dcache clean more appealing). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >>>>>>> VCPU was >>>>>>> our second option but current approach seemed very simple hence >>>>>>> we went for this. >>>>>>> >>>>>>> If more people vote for full d-cache clean upon first run of >>>>>>> VCPU then >>>>>>> we should revise this patch. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Can you please give the attached patch a spin on your HW? I've >>>>>> boot-tested >>>>>> it on a model, but of course I can't really verify that it fixes >>>>>> your >>>>>> issue. >>>>>> >>>>>> As far as I can see, it should fix it without any additional >>>>>> flushing. >>>>>> >>>>>> Please let me know how it goes. >>>>> >>>>> >>>>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>>>> treated as "Normal Non-shareable, Inner Write-Back >>>>> Write-Allocate, >>>>> Outer Write-Back Write-Allocate memory" >>>>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>>>> treated as "Strongly-ordered device memory" >>>>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >>>>> accessed as normal memory when HCR_EL2.DC=1. >>>> >>>> >>>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >>>> be >>>> accessed as device memory. >>>> >>>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >>>>> KVM ARM64 because we use VGIC. >>>>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >>>>> when Stage1 MMU is off. >>>> >>>> >>>> >>>> See above. My understanding is that HCR.DC controls the default >>>> output of >>>> Stage-1, and Stage-2 overrides still apply. >>> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >>> and wanted guest to decide the memory attribute. In other words, >>> you >>> did not want to enforce any memory attribute in Stage2. >>> >>> Please refer to this patch >>> https://patchwork.kernel.org/patch/2543201/ >> >> >> This patch has never been merged. If you carry on following the >> discussion, >> you will certainly notice it was dropped for a very good reason: >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >> >> So Stage-2 memory attributes are used, they are not going away, and >> they are >> essential to the patch I sent this morning. >> >> >> M. >> -- >> Fast, cheap, reliable. Pick two. > > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > is > provided a DMA-capable device in pass-through mode. The reason being > bootloader/firmware typically don't enable MMU and such > bootloader/firmware > will programme a pass-through DMA-capable device without any flushes > to > guest RAM (because it has not enabled MMU). > > A good example of such a device would be SATA AHCI controller given > to a > Guest/VM as direct access (using SystemMMU) and Guest > bootloader/firmware > accessing this SATA AHCI controller to load kernel images from SATA > disk. > In this situation, we will have to hack Guest bootloader/firmware > AHCI driver to > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).

OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?

Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

-Christoffer

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

--Anup

The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

When did anyone suggest using a cache cleaning method that doesn't apply to this case. I'm a little confused about your comment here.

Please refer last reply from Marc Z where he says: "Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?".

Rather doing __flush_dcache_range() on each page in coherent_icache_guest_page() we could also flush entire d-cache upon first VCPU run. This only issue in flushing d-cache upon first VCPU run is that we cannot flush d-cache by set/way because as per ARM ARM all operations by set/way are upto PoU and not PoC. In presence of L3-Cache we need to flush d-cache upto PoC so we have to use __flush_dache_range()

...
...
IMHO, we are left with following options:

Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU

We could do that, but we have to ensure that no other memory regions can be added later. Is this the case? I don't remember.

Yes, memory regions being added later be problem for this option.

...
...

Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU

What's the difference between (2) and (1)?

Linux ARM (i.e. 32-bit port) has a framework for having outer caches (i.e. caches that are not part of the CPU but exist as separate entity between CPU and Memory for performance improvement). Using this framework we can flush all CPU caches and outer cache.

And we don't have such a framework in arm64? But __flush_dcache_range does nevertheless flush outer caches as well?

...

...
...

Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

I'm sorry, but I'm not following this discussion. Are we not mixing a discussion along two different axis here? As I see it there are two separate issues:

(A) When/Where to flush the memory regions (B) How to flush the memory regions, including the hardware method and the kernel software engineering aspect.

Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU.

ok, I'm with you that far.

But is it also not true that we need to decide between:

A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in

And (independently):

B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64 B.3: Use flush_icache_range

Or do these all interleave somehow? If so, I don't understand why. Can you help?

-Christoffer

Christoffer Dall

6:06 p.m.

On Fri, Aug 16, 2013 at 10:50:34AM -0700, Christoffer Dall wrote:

...

On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel anup@brainfault.org wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > On 2013-08-15 16:13, Anup Patel wrote: > > Hi Marc, > > > > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com > > wrote: > >> On 2013-08-15 14:31, Anup Patel wrote: > >>> > >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > >>> marc.zyngier@arm.com > >>> wrote: > >>>> > >>>> On 2013-08-15 07:26, Anup Patel wrote: > >>>>> > >>>>> > >>>>> Hi Marc, > >>>>> > >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > >>>>> marc.zyngier@arm.com > >>>>> wrote: > >>>>>> > >>>>>> > >>>>>> Hi Anup, > >>>>>> > >>>>>> > >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > >>>>>>> marc.zyngier@arm.com > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi Pranav, > >>>>>>>> > >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Systems with large external L3-cache (few MBs), might have > >>>>>>>>> dirty > >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > >>>>>>>>> this, > >>>>>>>>> we need to flush such dirty content from d-cache so that > >>>>>>>>> guest > >>>>>>>>> will see correct contents of guest page when guest MMU is > >>>>>>>>> disabled. > >>>>>>>>> > >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > >>>>>>>>> L3-cache. > >>>>>>>>> > >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > >>>>>>>>> pranavkumar@linaro.org > >>>>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org > >>>>>>>>> --- > >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > >>>>>>>>> 1 file changed, 2 insertions(+) > >>>>>>>>> > >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> index efe609c..5129038 100644 > >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > >>>>>>>>> + /* Flush d-cache for systems with external > >>>>>>>>> caches. */ > >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > >>>>>>>>> ASID-tagged > >>>>>>>>> VIVT > >>>>>>>>> */ > >>>>>>>>> /* any kind of VIPT cache */ > >>>>>>>>> __flush_icache_all(); > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> [adding Will to the discussion as we talked about this in the > >>>>>>>> past] > >>>>>>>> > >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit > >>>>>>>> the > >>>>>>>> data > >>>>>>>> cache on each page mapping. It looks overkill. > >>>>>>>> > >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > >>>>>>>> kvmtools > >>>>>>>> knows which bits of the guest memory have been touched, and > >>>>>>>> can do a > >>>>>>>> "DC > >>>>>>>> DVAC" on this region. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > >>>>>>> I am > >>>>>>> sure > >>>>>>> other architectures don't have such cache cleaning in > >>>>>>> user-space for > >>>>>>> KVM. > >>>>>>> > >>>>>>>> > >>>>>>>> The alternative is do it in the kernel before running any vcpu > >>>>>>>> - but > >>>>>>>> that's not very nice either (have to clean the whole of the > >>>>>>>> guest > >>>>>>>> memory, which makes a full dcache clean more appealing). > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > >>>>>>> VCPU was > >>>>>>> our second option but current approach seemed very simple hence > >>>>>>> we went for this. > >>>>>>> > >>>>>>> If more people vote for full d-cache clean upon first run of > >>>>>>> VCPU then > >>>>>>> we should revise this patch. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Can you please give the attached patch a spin on your HW? I've > >>>>>> boot-tested > >>>>>> it on a model, but of course I can't really verify that it fixes > >>>>>> your > >>>>>> issue. > >>>>>> > >>>>>> As far as I can see, it should fix it without any additional > >>>>>> flushing. > >>>>>> > >>>>>> Please let me know how it goes. > >>>>> > >>>>> > >>>>> > >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > >>>>> treated as "Normal Non-shareable, Inner Write-Back > >>>>> Write-Allocate, > >>>>> Outer Write-Back Write-Allocate memory" > >>>>> > >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > >>>>> treated as "Strongly-ordered device memory" > >>>>> > >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > >>>>> accessed as normal memory when HCR_EL2.DC=1. > >>>> > >>>> > >>>> > >>>> I don't think so. Stage-2 still applies, and should force MMIO to > >>>> be > >>>> accessed as device memory. > >>>> > >>>> > >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > >>>>> KVM ARM64 because we use VGIC. > >>>>> > >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM > >>>>> when Stage1 MMU is off. > >>>> > >>>> > >>>> > >>>> See above. My understanding is that HCR.DC controls the default > >>>> output of > >>>> Stage-1, and Stage-2 overrides still apply. > >>> > >>> > >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > >>> and wanted guest to decide the memory attribute. In other words, > >>> you > >>> did not want to enforce any memory attribute in Stage2. > >>> > >>> Please refer to this patch > >>> https://patchwork.kernel.org/patch/2543201/ > >> > >> > >> This patch has never been merged. If you carry on following the > >> discussion, > >> you will certainly notice it was dropped for a very good reason: > >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > >> > >> So Stage-2 memory attributes are used, they are not going away, and > >> they are > >> essential to the patch I sent this morning. > >> > >> > >> M. > >> -- > >> Fast, cheap, reliable. Pick two. > > > > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > > is > > provided a DMA-capable device in pass-through mode. The reason being > > bootloader/firmware typically don't enable MMU and such > > bootloader/firmware > > will programme a pass-through DMA-capable device without any flushes > > to > > guest RAM (because it has not enabled MMU). > > > > A good example of such a device would be SATA AHCI controller given > > to a > > Guest/VM as direct access (using SystemMMU) and Guest > > bootloader/firmware > > accessing this SATA AHCI controller to load kernel images from SATA > > disk. > > In this situation, we will have to hack Guest bootloader/firmware > > AHCI driver to > > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > > OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > curiosity: is that a made up example or something you actually have? > > Back to square one: > Can you please benchmark the various cache cleaning options (global at > startup time, per-page on S2 translation fault, and user-space)? > Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

-Christoffer

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

--Anup

The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

When did anyone suggest using a cache cleaning method that doesn't apply to this case. I'm a little confused about your comment here.

Please refer last reply from Marc Z where he says: "Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?".

Rather doing __flush_dcache_range() on each page in coherent_icache_guest_page() we could also flush entire d-cache upon first VCPU run. This only issue in flushing d-cache upon first VCPU run is that we cannot flush d-cache by set/way because as per ARM ARM all operations by set/way are upto PoU and not PoC. In presence of L3-Cache we need to flush d-cache upto PoC so we have to use __flush_dache_range()

...
...
IMHO, we are left with following options:

Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU

We could do that, but we have to ensure that no other memory regions can be added later. Is this the case? I don't remember.

Yes, memory regions being added later be problem for this option.

...
...

Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU

What's the difference between (2) and (1)?

Linux ARM (i.e. 32-bit port) has a framework for having outer caches (i.e. caches that are not part of the CPU but exist as separate entity between CPU and Memory for performance improvement). Using this framework we can flush all CPU caches and outer cache.

And we don't have such a framework in arm64? But __flush_dcache_range does nevertheless flush outer caches as well?

...
...
...

Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

I'm sorry, but I'm not following this discussion. Are we not mixing a discussion along two different axis here? As I see it there are two separate issues:

(A) When/Where to flush the memory regions (B) How to flush the memory regions, including the hardware method and the kernel software engineering aspect.

Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU.

ok, I'm with you that far.

But is it also not true that we need to decide between:

A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in

And (independently):

B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64 B.3: Use flush_icache_range

Or do these all interleave somehow? If so, I don't understand why. Can you help?

Oh, I think I understand your point now. You mean if we use flush_cache_all before we run the vcpu, then it's not sufficient? I assumed we would simply be using __flush_dcache_area for the guest RAM regions which would flush to PoC.

For the record, I think we need a solution that also covers the case where a memory region is registered later, and I therefore prefer having this functionality in the stage-2 fault handler, where we are already taking care of similar issues (like your patch suggested).

Especially if we can limit ourselves to doing so when the guest MMU is disabled, then I really think this is going to be the least overhead and measuring the performance of this penalty vs. at first CPU execution is a bit overkill IMHO, since we are only talking about boot time for any reasonable guest (which would run with the MMU enabled for real workloads presumeably).

The only caveat is the previously discussed issue if user space loads code after the first VCPU execution, and only user space would know if that happens, which would argue for user space doing the cleaning... Hmmm.

I also still have my worries about swapping, since user space is free to map guest RAM as non-executable.

Did I miss anything here?

-Christoffer

Anup Patel

6:20 p.m.

On Fri, Aug 16, 2013 at 11:36 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...

On Fri, Aug 16, 2013 at 10:50:34AM -0700, Christoffer Dall wrote:

...
On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel anup@brainfault.org wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote: > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: >> On 2013-08-15 16:13, Anup Patel wrote: >> > Hi Marc, >> > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com >> > wrote: >> >> On 2013-08-15 14:31, Anup Patel wrote: >> >>> >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >> >>> marc.zyngier@arm.com >> >>> wrote: >> >>>> >> >>>> On 2013-08-15 07:26, Anup Patel wrote: >> >>>>> >> >>>>> >> >>>>> Hi Marc, >> >>>>> >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >> >>>>> marc.zyngier@arm.com >> >>>>> wrote: >> >>>>>> >> >>>>>> >> >>>>>> Hi Anup, >> >>>>>> >> >>>>>> >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >> >>>>>>> marc.zyngier@arm.com >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Hi Pranav, >> >>>>>>>> >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >> >>>>>>>>> dirty >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >> >>>>>>>>> this, >> >>>>>>>>> we need to flush such dirty content from d-cache so that >> >>>>>>>>> guest >> >>>>>>>>> will see correct contents of guest page when guest MMU is >> >>>>>>>>> disabled. >> >>>>>>>>> >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >> >>>>>>>>> L3-cache. >> >>>>>>>>> >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >> >>>>>>>>> pranavkumar@linaro.org >> >>>>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >> >>>>>>>>> --- >> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >> >>>>>>>>> 1 file changed, 2 insertions(+) >> >>>>>>>>> >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> index efe609c..5129038 100644 >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >> >>>>>>>>> + /* Flush d-cache for systems with external >> >>>>>>>>> caches. */ >> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >> >>>>>>>>> ASID-tagged >> >>>>>>>>> VIVT >> >>>>>>>>> */ >> >>>>>>>>> /* any kind of VIPT cache */ >> >>>>>>>>> __flush_icache_all(); >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> [adding Will to the discussion as we talked about this in the >> >>>>>>>> past] >> >>>>>>>> >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >> >>>>>>>> the >> >>>>>>>> data >> >>>>>>>> cache on each page mapping. It looks overkill. >> >>>>>>>> >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >> >>>>>>>> kvmtools >> >>>>>>>> knows which bits of the guest memory have been touched, and >> >>>>>>>> can do a >> >>>>>>>> "DC >> >>>>>>>> DVAC" on this region. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >> >>>>>>> I am >> >>>>>>> sure >> >>>>>>> other architectures don't have such cache cleaning in >> >>>>>>> user-space for >> >>>>>>> KVM. >> >>>>>>> >> >>>>>>>> >> >>>>>>>> The alternative is do it in the kernel before running any vcpu >> >>>>>>>> - but >> >>>>>>>> that's not very nice either (have to clean the whole of the >> >>>>>>>> guest >> >>>>>>>> memory, which makes a full dcache clean more appealing). >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >> >>>>>>> VCPU was >> >>>>>>> our second option but current approach seemed very simple hence >> >>>>>>> we went for this. >> >>>>>>> >> >>>>>>> If more people vote for full d-cache clean upon first run of >> >>>>>>> VCPU then >> >>>>>>> we should revise this patch. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> Can you please give the attached patch a spin on your HW? I've >> >>>>>> boot-tested >> >>>>>> it on a model, but of course I can't really verify that it fixes >> >>>>>> your >> >>>>>> issue. >> >>>>>> >> >>>>>> As far as I can see, it should fix it without any additional >> >>>>>> flushing. >> >>>>>> >> >>>>>> Please let me know how it goes. >> >>>>> >> >>>>> >> >>>>> >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >> >>>>> treated as "Normal Non-shareable, Inner Write-Back >> >>>>> Write-Allocate, >> >>>>> Outer Write-Back Write-Allocate memory" >> >>>>> >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >> >>>>> treated as "Strongly-ordered device memory" >> >>>>> >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >> >>>>> accessed as normal memory when HCR_EL2.DC=1. >> >>>> >> >>>> >> >>>> >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >> >>>> be >> >>>> accessed as device memory. >> >>>> >> >>>> >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >> >>>>> KVM ARM64 because we use VGIC. >> >>>>> >> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >> >>>>> when Stage1 MMU is off. >> >>>> >> >>>> >> >>>> >> >>>> See above. My understanding is that HCR.DC controls the default >> >>>> output of >> >>>> Stage-1, and Stage-2 overrides still apply. >> >>> >> >>> >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >> >>> and wanted guest to decide the memory attribute. In other words, >> >>> you >> >>> did not want to enforce any memory attribute in Stage2. >> >>> >> >>> Please refer to this patch >> >>> https://patchwork.kernel.org/patch/2543201/ >> >> >> >> >> >> This patch has never been merged. If you carry on following the >> >> discussion, >> >> you will certainly notice it was dropped for a very good reason: >> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >> >> >> >> So Stage-2 memory attributes are used, they are not going away, and >> >> they are >> >> essential to the patch I sent this morning. >> >> >> >> >> >> M. >> >> -- >> >> Fast, cheap, reliable. Pick two. >> > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM >> > is >> > provided a DMA-capable device in pass-through mode. The reason being >> > bootloader/firmware typically don't enable MMU and such >> > bootloader/firmware >> > will programme a pass-through DMA-capable device without any flushes >> > to >> > guest RAM (because it has not enabled MMU). >> > >> > A good example of such a device would be SATA AHCI controller given >> > to a >> > Guest/VM as direct access (using SystemMMU) and Guest >> > bootloader/firmware >> > accessing this SATA AHCI controller to load kernel images from SATA >> > disk. >> > In this situation, we will have to hack Guest bootloader/firmware >> > AHCI driver to >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). >> >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of >> curiosity: is that a made up example or something you actually have? >> >> Back to square one: >> Can you please benchmark the various cache cleaning options (global at >> startup time, per-page on S2 translation fault, and user-space)? >> > Eh, why is this a more valid argument than the vgic? The device > passthrough Stage-2 mappings would still have the Stage-2 memory > attributes to configure that memory region as device memory. Why is it > relevant if the device is DMA-capable in this context? > > -Christoffer

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

--Anup

The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

When did anyone suggest using a cache cleaning method that doesn't apply to this case. I'm a little confused about your comment here.

Please refer last reply from Marc Z where he says: "Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?".

Rather doing __flush_dcache_range() on each page in coherent_icache_guest_page() we could also flush entire d-cache upon first VCPU run. This only issue in flushing d-cache upon first VCPU run is that we cannot flush d-cache by set/way because as per ARM ARM all operations by set/way are upto PoU and not PoC. In presence of L3-Cache we need to flush d-cache upto PoC so we have to use __flush_dache_range()

...
...
IMHO, we are left with following options:

Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU

We could do that, but we have to ensure that no other memory regions can be added later. Is this the case? I don't remember.

Yes, memory regions being added later be problem for this option.

...
...

Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU

What's the difference between (2) and (1)?

Linux ARM (i.e. 32-bit port) has a framework for having outer caches (i.e. caches that are not part of the CPU but exist as separate entity between CPU and Memory for performance improvement). Using this framework we can flush all CPU caches and outer cache.

And we don't have such a framework in arm64? But __flush_dcache_range does nevertheless flush outer caches as well?

...
...
...

Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

I'm sorry, but I'm not following this discussion. Are we not mixing a discussion along two different axis here? As I see it there are two separate issues:

(A) When/Where to flush the memory regions (B) How to flush the memory regions, including the hardware method and the kernel software engineering aspect.

Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU.

ok, I'm with you that far.

But is it also not true that we need to decide between:

A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in

And (independently):

B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64 B.3: Use flush_icache_range

Or do these all interleave somehow? If so, I don't understand why. Can you help?

Oh, I think I understand your point now. You mean if we use flush_cache_all before we run the vcpu, then it's not sufficient? I assumed we would simply be using __flush_dcache_area for the guest RAM regions which would flush to PoC.

For the record, I think we need a solution that also covers the case where a memory region is registered later, and I therefore prefer having this functionality in the stage-2 fault handler, where we are already taking care of similar issues (like your patch suggested).

Especially if we can limit ourselves to doing so when the guest MMU is disabled, then I really think this is going to be the least overhead and measuring the performance of this penalty vs. at first CPU execution is a bit overkill IMHO, since we are only talking about boot time for any reasonable guest (which would run with the MMU enabled for real workloads presumeably).

Yes, currently we call coherent_icache_guest_page() upon all Stage2 translation faults but, we can be improve a bit here just like your suggestion. May be we can call coherent_icache_guest_page() only when VCPU MMU is off and we get Stage2 instruction prefetch translation fault.

...

The only caveat is the previously discussed issue if user space loads code after the first VCPU execution, and only user space would know if that happens, which would argue for user space doing the cleaning... Hmmm.

I also still have my worries about swapping, since user space is free to map guest RAM as non-executable.

Did I miss anything here?

I don't know much about Linux swapping but for the rest part we are in sync.

...

-Christoffer

--Anup

Anup Patel

6:11 p.m.

On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...

On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel anup@brainfault.org wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > On 2013-08-15 16:13, Anup Patel wrote: > > Hi Marc, > > > > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com > > wrote: > >> On 2013-08-15 14:31, Anup Patel wrote: > >>> > >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > >>> marc.zyngier@arm.com > >>> wrote: > >>>> > >>>> On 2013-08-15 07:26, Anup Patel wrote: > >>>>> > >>>>> > >>>>> Hi Marc, > >>>>> > >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > >>>>> marc.zyngier@arm.com > >>>>> wrote: > >>>>>> > >>>>>> > >>>>>> Hi Anup, > >>>>>> > >>>>>> > >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > >>>>>>> marc.zyngier@arm.com > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi Pranav, > >>>>>>>> > >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Systems with large external L3-cache (few MBs), might have > >>>>>>>>> dirty > >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > >>>>>>>>> this, > >>>>>>>>> we need to flush such dirty content from d-cache so that > >>>>>>>>> guest > >>>>>>>>> will see correct contents of guest page when guest MMU is > >>>>>>>>> disabled. > >>>>>>>>> > >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > >>>>>>>>> L3-cache. > >>>>>>>>> > >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > >>>>>>>>> pranavkumar@linaro.org > >>>>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org > >>>>>>>>> --- > >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > >>>>>>>>> 1 file changed, 2 insertions(+) > >>>>>>>>> > >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> index efe609c..5129038 100644 > >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > >>>>>>>>> + /* Flush d-cache for systems with external > >>>>>>>>> caches. */ > >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > >>>>>>>>> ASID-tagged > >>>>>>>>> VIVT > >>>>>>>>> */ > >>>>>>>>> /* any kind of VIPT cache */ > >>>>>>>>> __flush_icache_all(); > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> [adding Will to the discussion as we talked about this in the > >>>>>>>> past] > >>>>>>>> > >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit > >>>>>>>> the > >>>>>>>> data > >>>>>>>> cache on each page mapping. It looks overkill. > >>>>>>>> > >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > >>>>>>>> kvmtools > >>>>>>>> knows which bits of the guest memory have been touched, and > >>>>>>>> can do a > >>>>>>>> "DC > >>>>>>>> DVAC" on this region. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > >>>>>>> I am > >>>>>>> sure > >>>>>>> other architectures don't have such cache cleaning in > >>>>>>> user-space for > >>>>>>> KVM. > >>>>>>> > >>>>>>>> > >>>>>>>> The alternative is do it in the kernel before running any vcpu > >>>>>>>> - but > >>>>>>>> that's not very nice either (have to clean the whole of the > >>>>>>>> guest > >>>>>>>> memory, which makes a full dcache clean more appealing). > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > >>>>>>> VCPU was > >>>>>>> our second option but current approach seemed very simple hence > >>>>>>> we went for this. > >>>>>>> > >>>>>>> If more people vote for full d-cache clean upon first run of > >>>>>>> VCPU then > >>>>>>> we should revise this patch. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Can you please give the attached patch a spin on your HW? I've > >>>>>> boot-tested > >>>>>> it on a model, but of course I can't really verify that it fixes > >>>>>> your > >>>>>> issue. > >>>>>> > >>>>>> As far as I can see, it should fix it without any additional > >>>>>> flushing. > >>>>>> > >>>>>> Please let me know how it goes. > >>>>> > >>>>> > >>>>> > >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > >>>>> treated as "Normal Non-shareable, Inner Write-Back > >>>>> Write-Allocate, > >>>>> Outer Write-Back Write-Allocate memory" > >>>>> > >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > >>>>> treated as "Strongly-ordered device memory" > >>>>> > >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > >>>>> accessed as normal memory when HCR_EL2.DC=1. > >>>> > >>>> > >>>> > >>>> I don't think so. Stage-2 still applies, and should force MMIO to > >>>> be > >>>> accessed as device memory. > >>>> > >>>> > >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > >>>>> KVM ARM64 because we use VGIC. > >>>>> > >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM > >>>>> when Stage1 MMU is off. > >>>> > >>>> > >>>> > >>>> See above. My understanding is that HCR.DC controls the default > >>>> output of > >>>> Stage-1, and Stage-2 overrides still apply. > >>> > >>> > >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > >>> and wanted guest to decide the memory attribute. In other words, > >>> you > >>> did not want to enforce any memory attribute in Stage2. > >>> > >>> Please refer to this patch > >>> https://patchwork.kernel.org/patch/2543201/ > >> > >> > >> This patch has never been merged. If you carry on following the > >> discussion, > >> you will certainly notice it was dropped for a very good reason: > >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > >> > >> So Stage-2 memory attributes are used, they are not going away, and > >> they are > >> essential to the patch I sent this morning. > >> > >> > >> M. > >> -- > >> Fast, cheap, reliable. Pick two. > > > > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > > is > > provided a DMA-capable device in pass-through mode. The reason being > > bootloader/firmware typically don't enable MMU and such > > bootloader/firmware > > will programme a pass-through DMA-capable device without any flushes > > to > > guest RAM (because it has not enabled MMU). > > > > A good example of such a device would be SATA AHCI controller given > > to a > > Guest/VM as direct access (using SystemMMU) and Guest > > bootloader/firmware > > accessing this SATA AHCI controller to load kernel images from SATA > > disk. > > In this situation, we will have to hack Guest bootloader/firmware > > AHCI driver to > > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > > OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > curiosity: is that a made up example or something you actually have? > > Back to square one: > Can you please benchmark the various cache cleaning options (global at > startup time, per-page on S2 translation fault, and user-space)? > Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

-Christoffer

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

--Anup

The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

When did anyone suggest using a cache cleaning method that doesn't apply to this case. I'm a little confused about your comment here.

Please refer last reply from Marc Z where he says: "Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?".

Rather doing __flush_dcache_range() on each page in coherent_icache_guest_page() we could also flush entire d-cache upon first VCPU run. This only issue in flushing d-cache upon first VCPU run is that we cannot flush d-cache by set/way because as per ARM ARM all operations by set/way are upto PoU and not PoC. In presence of L3-Cache we need to flush d-cache upto PoC so we have to use __flush_dache_range()

...
...
IMHO, we are left with following options:

Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU

We could do that, but we have to ensure that no other memory regions can be added later. Is this the case? I don't remember.

Yes, memory regions being added later be problem for this option.

...
...

Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU

What's the difference between (2) and (1)?

Linux ARM (i.e. 32-bit port) has a framework for having outer caches (i.e. caches that are not part of the CPU but exist as separate entity between CPU and Memory for performance improvement). Using this framework we can flush all CPU caches and outer cache.

And we don't have such a framework in arm64? But __flush_dcache_range

Yes, for now we don't have outer-cache framework in arm64.

...

does nevertheless flush outer caches as well?

Yes, __flush_dcache_range() flushes the outer cache too.

The __flush_dcache_range() works for us because it flushes (i.e. clean & invalidate) by VA upto PoC. All operation upto PoC on a cache line will force eviction of the cache line from all CPU caches (i.e. both L1 & L2) and also from external L3-cache to ensure that RAM has updated contents of cache line.

Now, the outer caches (such as L3-cache) typically provide a mechanism to flush entire L3-cache using L3-cache registers. If we have an outer-cache framework then we can flush all caches using __flush_dcache_all() + outer cache flush all.

...

...
...
...

Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

I'm sorry, but I'm not following this discussion. Are we not mixing a discussion along two different axis here? As I see it there are two separate issues:

(A) When/Where to flush the memory regions (B) How to flush the memory regions, including the hardware method and the kernel software engineering aspect.

Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU.

ok, I'm with you that far.

But is it also not true that we need to decide between:

A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in

Yes, thats the decision we have to make.

...

And (independently):

B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64

This would be __flush_dcache_all() + outer cache flush all.

...

B.3: Use flush_icache_range

Actually modified version of flush_icache_range() which uses "dc cvac" and not "dc cvau".

...

Or do these all interleave somehow? If so, I don't understand why. Can you help?

-Christoffer

--Anup

Christoffer Dall

6:20 p.m.

On Fri, Aug 16, 2013 at 11:41:51PM +0530, Anup Patel wrote:

...

On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel anup@brainfault.org wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote: > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: >> On 2013-08-15 16:13, Anup Patel wrote: >> > Hi Marc, >> > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com >> > wrote: >> >> On 2013-08-15 14:31, Anup Patel wrote: >> >>> >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >> >>> marc.zyngier@arm.com >> >>> wrote: >> >>>> >> >>>> On 2013-08-15 07:26, Anup Patel wrote: >> >>>>> >> >>>>> >> >>>>> Hi Marc, >> >>>>> >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >> >>>>> marc.zyngier@arm.com >> >>>>> wrote: >> >>>>>> >> >>>>>> >> >>>>>> Hi Anup, >> >>>>>> >> >>>>>> >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >> >>>>>>> marc.zyngier@arm.com >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Hi Pranav, >> >>>>>>>> >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >> >>>>>>>>> dirty >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >> >>>>>>>>> this, >> >>>>>>>>> we need to flush such dirty content from d-cache so that >> >>>>>>>>> guest >> >>>>>>>>> will see correct contents of guest page when guest MMU is >> >>>>>>>>> disabled. >> >>>>>>>>> >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >> >>>>>>>>> L3-cache. >> >>>>>>>>> >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >> >>>>>>>>> pranavkumar@linaro.org >> >>>>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >> >>>>>>>>> --- >> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >> >>>>>>>>> 1 file changed, 2 insertions(+) >> >>>>>>>>> >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> index efe609c..5129038 100644 >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >> >>>>>>>>> + /* Flush d-cache for systems with external >> >>>>>>>>> caches. */ >> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >> >>>>>>>>> ASID-tagged >> >>>>>>>>> VIVT >> >>>>>>>>> */ >> >>>>>>>>> /* any kind of VIPT cache */ >> >>>>>>>>> __flush_icache_all(); >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> [adding Will to the discussion as we talked about this in the >> >>>>>>>> past] >> >>>>>>>> >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >> >>>>>>>> the >> >>>>>>>> data >> >>>>>>>> cache on each page mapping. It looks overkill. >> >>>>>>>> >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >> >>>>>>>> kvmtools >> >>>>>>>> knows which bits of the guest memory have been touched, and >> >>>>>>>> can do a >> >>>>>>>> "DC >> >>>>>>>> DVAC" on this region. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >> >>>>>>> I am >> >>>>>>> sure >> >>>>>>> other architectures don't have such cache cleaning in >> >>>>>>> user-space for >> >>>>>>> KVM. >> >>>>>>> >> >>>>>>>> >> >>>>>>>> The alternative is do it in the kernel before running any vcpu >> >>>>>>>> - but >> >>>>>>>> that's not very nice either (have to clean the whole of the >> >>>>>>>> guest >> >>>>>>>> memory, which makes a full dcache clean more appealing). >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >> >>>>>>> VCPU was >> >>>>>>> our second option but current approach seemed very simple hence >> >>>>>>> we went for this. >> >>>>>>> >> >>>>>>> If more people vote for full d-cache clean upon first run of >> >>>>>>> VCPU then >> >>>>>>> we should revise this patch. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> Can you please give the attached patch a spin on your HW? I've >> >>>>>> boot-tested >> >>>>>> it on a model, but of course I can't really verify that it fixes >> >>>>>> your >> >>>>>> issue. >> >>>>>> >> >>>>>> As far as I can see, it should fix it without any additional >> >>>>>> flushing. >> >>>>>> >> >>>>>> Please let me know how it goes. >> >>>>> >> >>>>> >> >>>>> >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >> >>>>> treated as "Normal Non-shareable, Inner Write-Back >> >>>>> Write-Allocate, >> >>>>> Outer Write-Back Write-Allocate memory" >> >>>>> >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >> >>>>> treated as "Strongly-ordered device memory" >> >>>>> >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >> >>>>> accessed as normal memory when HCR_EL2.DC=1. >> >>>> >> >>>> >> >>>> >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >> >>>> be >> >>>> accessed as device memory. >> >>>> >> >>>> >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >> >>>>> KVM ARM64 because we use VGIC. >> >>>>> >> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >> >>>>> when Stage1 MMU is off. >> >>>> >> >>>> >> >>>> >> >>>> See above. My understanding is that HCR.DC controls the default >> >>>> output of >> >>>> Stage-1, and Stage-2 overrides still apply. >> >>> >> >>> >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >> >>> and wanted guest to decide the memory attribute. In other words, >> >>> you >> >>> did not want to enforce any memory attribute in Stage2. >> >>> >> >>> Please refer to this patch >> >>> https://patchwork.kernel.org/patch/2543201/ >> >> >> >> >> >> This patch has never been merged. If you carry on following the >> >> discussion, >> >> you will certainly notice it was dropped for a very good reason: >> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >> >> >> >> So Stage-2 memory attributes are used, they are not going away, and >> >> they are >> >> essential to the patch I sent this morning. >> >> >> >> >> >> M. >> >> -- >> >> Fast, cheap, reliable. Pick two. >> > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM >> > is >> > provided a DMA-capable device in pass-through mode. The reason being >> > bootloader/firmware typically don't enable MMU and such >> > bootloader/firmware >> > will programme a pass-through DMA-capable device without any flushes >> > to >> > guest RAM (because it has not enabled MMU). >> > >> > A good example of such a device would be SATA AHCI controller given >> > to a >> > Guest/VM as direct access (using SystemMMU) and Guest >> > bootloader/firmware >> > accessing this SATA AHCI controller to load kernel images from SATA >> > disk. >> > In this situation, we will have to hack Guest bootloader/firmware >> > AHCI driver to >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). >> >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of >> curiosity: is that a made up example or something you actually have? >> >> Back to square one: >> Can you please benchmark the various cache cleaning options (global at >> startup time, per-page on S2 translation fault, and user-space)? >> > Eh, why is this a more valid argument than the vgic? The device > passthrough Stage-2 mappings would still have the Stage-2 memory > attributes to configure that memory region as device memory. Why is it > relevant if the device is DMA-capable in this context? > > -Christoffer

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

--Anup

The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

When did anyone suggest using a cache cleaning method that doesn't apply to this case. I'm a little confused about your comment here.

Please refer last reply from Marc Z where he says: "Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?".

Rather doing __flush_dcache_range() on each page in coherent_icache_guest_page() we could also flush entire d-cache upon first VCPU run. This only issue in flushing d-cache upon first VCPU run is that we cannot flush d-cache by set/way because as per ARM ARM all operations by set/way are upto PoU and not PoC. In presence of L3-Cache we need to flush d-cache upto PoC so we have to use __flush_dache_range()

...
...
IMHO, we are left with following options:

Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU

We could do that, but we have to ensure that no other memory regions can be added later. Is this the case? I don't remember.

Yes, memory regions being added later be problem for this option.

...
...

Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU

What's the difference between (2) and (1)?

Linux ARM (i.e. 32-bit port) has a framework for having outer caches (i.e. caches that are not part of the CPU but exist as separate entity between CPU and Memory for performance improvement). Using this framework we can flush all CPU caches and outer cache.

And we don't have such a framework in arm64? But __flush_dcache_range

Yes, for now we don't have outer-cache framework in arm64.

...
does nevertheless flush outer caches as well?

Yes, __flush_dcache_range() flushes the outer cache too.

The __flush_dcache_range() works for us because it flushes (i.e. clean & invalidate) by VA upto PoC. All operation upto PoC on a cache line will force eviction of the cache line from all CPU caches (i.e. both L1 & L2) and also from external L3-cache to ensure that RAM has updated contents of cache line.

Now, the outer caches (such as L3-cache) typically provide a mechanism to flush entire L3-cache using L3-cache registers. If we have an outer-cache framework then we can flush all caches using __flush_dcache_all() + outer cache flush all.

...
...
...
...

Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

I'm sorry, but I'm not following this discussion. Are we not mixing a discussion along two different axis here? As I see it there are two separate issues:

(A) When/Where to flush the memory regions (B) How to flush the memory regions, including the hardware method and the kernel software engineering aspect.

Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU.

ok, I'm with you that far.

But is it also not true that we need to decide between:

A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in

Yes, thats the decision we have to make.

...
And (independently):

B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64

This would be __flush_dcache_all() + outer cache flush all.

...
B.3: Use flush_icache_range

Actually modified version of flush_icache_range() which uses "dc cvac" and not "dc cvau".

...
Or do these all interleave somehow? If so, I don't understand why. Can you help?

Hi Anup,

Thanks for the explanation, it's crystal now.

-Christoffer

Catalin Marinas

30 Aug 30 Aug

9:52 a.m.

On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote:

...

On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:

...
Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU.

ok, I'm with you that far.

But is it also not true that we need to decide between:

A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in

Yes, thats the decision we have to make.

...
And (independently):

B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64

This would be __flush_dcache_all() + outer cache flush all.

We need to be careful here since the __flush_dcache_all() operation uses cache maintenance by set/way and these are *local* to a CPU (IOW not broadcast). Do you have any guarantee that dirty cache lines don't migrate between CPUs and __flush_dcache_all() wouldn't miss them? Architecturally we don't, so this is not a safe operation that would guarantee L1 cache flushing (we probably need to revisit some of the __flush_dcache_all() calls in KVM, I haven't looked into this).

So I think we are left to the range operation where the DC ops to PoC would be enough for your L3.

An outer cache flush all is probably only needed for cpuidle/suspend (the booting part should be handled by the boot loader).

-- Catalin

Anup Patel

10:44 a.m.

On Fri, Aug 30, 2013 at 3:22 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...

On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:

...
Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU.

ok, I'm with you that far.

But is it also not true that we need to decide between:

A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in

Yes, thats the decision we have to make.

...
And (independently):

B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64

This would be __flush_dcache_all() + outer cache flush all.

We need to be careful here since the __flush_dcache_all() operation uses cache maintenance by set/way and these are *local* to a CPU (IOW not broadcast). Do you have any guarantee that dirty cache lines don't migrate between CPUs and __flush_dcache_all() wouldn't miss them? Architecturally we don't, so this is not a safe operation that would guarantee L1 cache flushing (we probably need to revisit some of the __flush_dcache_all() calls in KVM, I haven't looked into this).

So I think we are left to the range operation where the DC ops to PoC would be enough for your L3.

If __flush_dcache_all() is *local" to a CPU then I guess DC ops to PoC by range would be the only option.

...

An outer cache flush all is probably only needed for cpuidle/suspend (the booting part should be handled by the boot loader).

Yes, cpuidle/suspend would definitely require outer cache maintenance.

For KVM, we can avoid flushing d-cache to PoC every time in coherent_icache_guest_page() by only doing it when Guest MMU is turned-off. This may reduce the performance penalty.

...

-- Catalin

--Anup

Catalin Marinas

1:02 p.m.

On Fri, Aug 30, 2013 at 11:44:30AM +0100, Anup Patel wrote:

...

On Fri, Aug 30, 2013 at 3:22 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:

...
Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU.

ok, I'm with you that far.

But is it also not true that we need to decide between:

A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in

Yes, thats the decision we have to make.

...
And (independently):

B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64

This would be __flush_dcache_all() + outer cache flush all.

We need to be careful here since the __flush_dcache_all() operation uses cache maintenance by set/way and these are *local* to a CPU (IOW not broadcast). Do you have any guarantee that dirty cache lines don't migrate between CPUs and __flush_dcache_all() wouldn't miss them? Architecturally we don't, so this is not a safe operation that would guarantee L1 cache flushing (we probably need to revisit some of the __flush_dcache_all() calls in KVM, I haven't looked into this).

So I think we are left to the range operation where the DC ops to PoC would be enough for your L3.

If __flush_dcache_all() is *local" to a CPU then I guess DC ops to PoC by range would be the only option.

Yes. In the (upcoming) ARM ARMv8 there is a clear note that set/way operations to flush the whole cache must not be used for the maintenance of large buffer but only during power-down/power-up code sequences.

...

...
An outer cache flush all is probably only needed for cpuidle/suspend (the booting part should be handled by the boot loader).

Yes, cpuidle/suspend would definitely require outer cache maintenance.

For KVM, we can avoid flushing d-cache to PoC every time in coherent_icache_guest_page() by only doing it when Guest MMU is turned-off. This may reduce the performance penalty.

That's for the KVM guys to decide ;)

-- Catalin

Marc Zyngier

1:21 p.m.

On 2013-08-30 11:44, Anup Patel wrote:

...

On Fri, Aug 30, 2013 at 3:22 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote:

...
On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote:

...
Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting

a VCPU

...
...
user-space typically loads images to guest RAM so, in-presence

of

...
...
huge L3-cache (few MBs). When the VCPU starts running some of

the

...
...
contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush

the

...
...
guest RAM contents before they are accessed by first time by

VCPU.

...
...
ok, I'm with you that far.

But is it also not true that we need to decide between:

A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in

Yes, thats the decision we have to make.

...
And (independently):

B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64

This would be __flush_dcache_all() + outer cache flush all.

We need to be careful here since the __flush_dcache_all() operation uses cache maintenance by set/way and these are *local* to a CPU (IOW not broadcast). Do you have any guarantee that dirty cache lines don't migrate between CPUs and __flush_dcache_all() wouldn't miss them? Architecturally we don't, so this is not a safe operation that would guarantee L1 cache flushing (we probably need to revisit some of the __flush_dcache_all() calls in KVM, I haven't looked into this).

So I think we are left to the range operation where the DC ops to PoC would be enough for your L3.

If __flush_dcache_all() is *local" to a CPU then I guess DC ops to PoC by range would be the only option.

...
An outer cache flush all is probably only needed for cpuidle/suspend (the booting part should be handled by the boot loader).

Yes, cpuidle/suspend would definitely require outer cache maintenance.

For KVM, we can avoid flushing d-cache to PoC every time in coherent_icache_guest_page() by only doing it when Guest MMU is turned-off. This may reduce the performance penalty.

What about the I and C bits in SCTLR_EL1? Does L3 also honour these bits?

-- Fast, cheap, reliable. Pick two.

Catalin Marinas

2:04 p.m.

On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:

...

On 2013-08-30 11:44, Anup Patel wrote:

...
For KVM, we can avoid flushing d-cache to PoC every time in coherent_icache_guest_page() by only doing it when Guest MMU is turned-off. This may reduce the performance penalty.

What about the I and C bits in SCTLR_EL1? Does L3 also honour these bits?

I would think so, it probably cares about how the transactions are presented at the bus level by the CPU.

-- Catalin

Marc Zyngier

2:22 p.m.

On 2013-08-30 15:04, Catalin Marinas wrote:

...

On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:

...
On 2013-08-30 11:44, Anup Patel wrote:

...
For KVM, we can avoid flushing d-cache to PoC every time in coherent_icache_guest_page() by only doing it when Guest MMU is turned-off. This may reduce the performance penalty.

What about the I and C bits in SCTLR_EL1? Does L3 also honour these bits?

I would think so, it probably cares about how the transactions are presented at the bus level by the CPU.

That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in order to find out whether or not we need to clean the cache.

Mumble mumble...

-- Fast, cheap, reliable. Pick two.

Will Deacon

2:30 p.m.

On Fri, Aug 30, 2013 at 03:22:11PM +0100, Marc Zyngier wrote:

...

On 2013-08-30 15:04, Catalin Marinas wrote:

...
On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:

...
On 2013-08-30 11:44, Anup Patel wrote:

...
For KVM, we can avoid flushing d-cache to PoC every time in coherent_icache_guest_page() by only doing it when Guest MMU is turned-off. This may reduce the performance penalty.

What about the I and C bits in SCTLR_EL1? Does L3 also honour these bits?

I would think so, it probably cares about how the transactions are presented at the bus level by the CPU.

That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in order to find out whether or not we need to clean the cache.

<horrible hack> Could you use a dummy ATS_* operation, then read the cacheability attributes out of the PAR? </horrible hack>

Will

Anup Patel

2:52 p.m.

Hi Will,

On Fri, Aug 30, 2013 at 8:00 PM, Will Deacon will.deacon@arm.com wrote:

...

On Fri, Aug 30, 2013 at 03:22:11PM +0100, Marc Zyngier wrote:

...
On 2013-08-30 15:04, Catalin Marinas wrote:

...
On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:

...
On 2013-08-30 11:44, Anup Patel wrote:

...
For KVM, we can avoid flushing d-cache to PoC every time in coherent_icache_guest_page() by only doing it when Guest MMU is turned-off. This may reduce the performance penalty.

What about the I and C bits in SCTLR_EL1? Does L3 also honour these bits?

I would think so, it probably cares about how the transactions are presented at the bus level by the CPU.

That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in order to find out whether or not we need to clean the cache.

<horrible hack> Could you use a dummy ATS_* operation, then read the cacheability attributes out of the PAR? </horrible hack>

We will have to use ATS12xxx operation which can only be executed from EL2 or EL3. The host kernel runs at EL1 hence we will need to use HVC call to execute ATS12xxx operation in EL2 mode which in-turn means a world-switch overhead just to execute ATS12xxx operation.

...

Will

--Anup

Marc Zyngier

3:12 p.m.

Hi Will,

On 2013-08-30 15:30, Will Deacon wrote:

...

On Fri, Aug 30, 2013 at 03:22:11PM +0100, Marc Zyngier wrote:

...
On 2013-08-30 15:04, Catalin Marinas wrote:

...
On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote:

...
On 2013-08-30 11:44, Anup Patel wrote:

...
For KVM, we can avoid flushing d-cache to PoC every time in coherent_icache_guest_page() by only doing it when Guest MMU is turned-off. This may reduce the performance penalty.

What about the I and C bits in SCTLR_EL1? Does L3 also honour

these

...
...
bits?

I would think so, it probably cares about how the transactions are presented at the bus level by the CPU.

That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in order to find out whether or not we need to clean the cache.

<horrible hack> Could you use a dummy ATS_* operation, then read the cacheability attributes out of the PAR? </horrible hack>

That'd be doable, and maybe not that expensive (using ATS1, the Stage-1 translation is probably still cached after the translation fault). Actually, I suspect this is more correct than just looking at SCTLR_EL1, as it doesn't assume that the guest uses always cacheable memory when enables the caches.

Shame HPFAR_EL2 doesn't report this information though...

-- Fast, cheap, reliable. Pick two.

Catalin Marinas

29 Aug 29 Aug

10:52 a.m.

Hi Anup,

Jumping late into this thread (holidays).

On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:

...

The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

I don't understand where you got the idea that set/way operations work up to the point of unification. This is incorrect, the set/way operations work on the level of cache specified by bits 3:1 in the register passed to the DC CISW instruction. For your L3 cache, those bits would be 2 (and __flush_dcache_all() implementation does this dynamically).

For the I-cache all operation, that's correct, it only invalidates to the PoU but it doesn't need to go any further, you do the D-cache maintenance for this (no point in duplicating functionality).

...

Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

Do you mean the CPU is not aware of the L3 cache? Does CLIDR_EL1 report an L3 cache on your implementation?

It's not clear to me whether your L3 cache is inner or outer (or a mix). You say that D-cache maintenance to PoC flushes the L3 which looks to me like an inner cache, in which cache it should be reported in the LoC bits in CLIDR_EL1.

...

IMHO, we are left with following options:

Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU

Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU

Do you have specific instructions for flushing the L3 cache only? It's not clear to me what an outer-cache framework would to on AArch64. It was added on AArch32 for the L2x0/PL310 which need separate instructions by physical address for flushing the cache. I really hope we don't get these again on ARMv8 hardware.

...

Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

flush_icache_range() is meant to flush only to the PoU to ensure the I-D cache coherency. I don't think we should change this.

-- Catalin

Anup Patel

12:31 p.m.

On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...

Hi Anup,

Jumping late into this thread (holidays).

On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:

...
The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

I don't understand where you got the idea that set/way operations work up to the point of unification. This is incorrect, the set/way operations work on the level of cache specified by bits 3:1 in the register passed to the DC CISW instruction. For your L3 cache, those bits would be 2 (and __flush_dcache_all() implementation does this dynamically).

The L3-cache is not visible to CPU. It is totally independent and transparent to CPU.

...

For the I-cache all operation, that's correct, it only invalidates to the PoU but it doesn't need to go any further, you do the D-cache maintenance for this (no point in duplicating functionality).

...
Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64.

Do you mean the CPU is not aware of the L3 cache? Does CLIDR_EL1 report an L3 cache on your implementation?

Yes, CPU is not aware of L3-cache and its state.

In our case, CLIDR_EL1 does not report L3-cache.

...

It's not clear to me whether your L3 cache is inner or outer (or a mix). You say that D-cache maintenance to PoC flushes the L3 which looks to me like an inner cache, in which cache it should be reported in the LoC bits in CLIDR_EL1.

...
IMHO, we are left with following options:

Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU

Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU

Do you have specific instructions for flushing the L3 cache only? It's not clear to me what an outer-cache framework would to on AArch64. It was added on AArch32 for the L2x0/PL310 which need separate instructions by physical address for flushing the cache. I really hope we don't get these again on ARMv8 hardware.

...

Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only.

flush_icache_range() is meant to flush only to the PoU to ensure the I-D cache coherency. I don't think we should change this.

We did not mean to change flush_icache_range() instead we suggest to have separate flush_icache_range_kvm() for KVM ARM64 which flushes d-cache to PoC.

...

-- Catalin

--Anup

Catalin Marinas

12:53 p.m.

On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote:

...

On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:

...
The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

I don't understand where you got the idea that set/way operations work up to the point of unification. This is incorrect, the set/way operations work on the level of cache specified by bits 3:1 in the register passed to the DC CISW instruction. For your L3 cache, those bits would be 2 (and __flush_dcache_all() implementation does this dynamically).

The L3-cache is not visible to CPU. It is totally independent and transparent to CPU.

OK. But you say that operations like DC CIVAC actually flush the L3? So I don't see it as completely transparent to the CPU.

Do you have any configuration bits which would make the L3 completely transparent like always caching even when accesses are non-cacheable and DC ops to PoC ignoring it?

Now, back to the idea of outer_cache framework for arm64. Does your CPU have separate instructions for flushing this L3 cache?

-- Catalin

Anup Patel

4:02 p.m.

On Thu, Aug 29, 2013 at 6:23 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...

On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote:

...
On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:

...
The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

I don't understand where you got the idea that set/way operations work up to the point of unification. This is incorrect, the set/way operations work on the level of cache specified by bits 3:1 in the register passed to the DC CISW instruction. For your L3 cache, those bits would be 2 (and __flush_dcache_all() implementation does this dynamically).

The L3-cache is not visible to CPU. It is totally independent and transparent to CPU.

OK. But you say that operations like DC CIVAC actually flush the L3? So I don't see it as completely transparent to the CPU.

It is transparent from CPU perspective. In other words, there is nothing in CPU for controlling/monitoring L3-cache.

...

Do you have any configuration bits which would make the L3 completely transparent like always caching even when accesses are non-cacheable and DC ops to PoC ignoring it?

Actually, L3-cache monitors the types of read/write generated by CPU (i.e. whether the request is cacheable/non-cacheable or whether the request is due to DC ops to PoC, or ...).

To answer your query, there is no configuration to have L3 caching when accesses are non-cacheable and DC ops to PoC.

...

Now, back to the idea of outer_cache framework for arm64. Does your CPU have separate instructions for flushing this L3 cache?

No, CPU does not have separate instruction for flushing L3-cache. On the other hand, L3-cache has MMIO registers which can be use to explicitly flush L3-cache.

...

-- Catalin

--Anup

Catalin Marinas

30 Aug 30 Aug

9:44 a.m.

On Thu, Aug 29, 2013 at 05:02:50PM +0100, Anup Patel wrote:

...

On Thu, Aug 29, 2013 at 6:23 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote:

...
On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:

...
The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

I don't understand where you got the idea that set/way operations work up to the point of unification. This is incorrect, the set/way operations work on the level of cache specified by bits 3:1 in the register passed to the DC CISW instruction. For your L3 cache, those bits would be 2 (and __flush_dcache_all() implementation does this dynamically).

The L3-cache is not visible to CPU. It is totally independent and transparent to CPU.

OK. But you say that operations like DC CIVAC actually flush the L3? So I don't see it as completely transparent to the CPU.

It is transparent from CPU perspective. In other words, there is nothing in CPU for controlling/monitoring L3-cache.

We probably have a different understanding of "transparent". It doesn't look to me like any more transparent than the L1 or L2 cache. Basically, from a software perspective, it needs maintenance. Whether the CPU explicitly asks the L3 cache for this or the L3 cache figures it on its own based on the L1/L2 operations is irrelevant.

It would have been transparent if the software didn't need to know about it at all, but it's not the case.

...

...
Do you have any configuration bits which would make the L3 completely transparent like always caching even when accesses are non-cacheable and DC ops to PoC ignoring it?

Actually, L3-cache monitors the types of read/write generated by CPU (i.e. whether the request is cacheable/non-cacheable or whether the request is due to DC ops to PoC, or ...).

To answer your query, there is no configuration to have L3 caching when accesses are non-cacheable and DC ops to PoC.

So it's an outer cache with some "improvements" to handle DC ops to PoC. I think it was a pretty bad decision on the hardware side as we really try to get rid of outer caches for many reasons:

1. Non-standard cache flushing operations (MMIO-based) 2. It may require cache maintenance by physical address - something hard to get in a guest OS (unless you virtualise L3 cache maintenance) 3. Are barriers like DSB propagated correctly? Does a DC op to PoC followed by DSB ensure that the L3 drained the cachelines to RAM?

I think point 2 isn't required because your L3 detects DC ops to PoC. I hope point 3 is handled correctly (otherwise look how "nice" the mb() macro on arm is to cope with L2x0).

If only 1 is left, we don't need the full outer_cache framework but it still needs to be addressed since the assumption is that flush_cache_all (or __flush_dcache_all) flushes all cache levels. These are not used in generic code but are used during kernel booting, KVM and cpuidle drivers.

...

...
Now, back to the idea of outer_cache framework for arm64. Does your CPU have separate instructions for flushing this L3 cache?

No, CPU does not have separate instruction for flushing L3-cache. On the other hand, L3-cache has MMIO registers which can be use to explicitly flush L3-cache.

I guess you use those in your firmware or boot loader since Linux requires clean/invalidated caches at boot (and I plan to push a patch which removes kernel D-cache cleaning during boot to spot such problems early). A cpuidle driver would probably need this as well.

-- Catalin

Anup Patel

10:36 a.m.

On Fri, Aug 30, 2013 at 3:14 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...

On Thu, Aug 29, 2013 at 05:02:50PM +0100, Anup Patel wrote:

...
On Thu, Aug 29, 2013 at 6:23 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote:

...
On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote:

...
The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification.

I don't understand where you got the idea that set/way operations work up to the point of unification. This is incorrect, the set/way operations work on the level of cache specified by bits 3:1 in the register passed to the DC CISW instruction. For your L3 cache, those bits would be 2 (and __flush_dcache_all() implementation does this dynamically).

The L3-cache is not visible to CPU. It is totally independent and transparent to CPU.

OK. But you say that operations like DC CIVAC actually flush the L3? So I don't see it as completely transparent to the CPU.

It is transparent from CPU perspective. In other words, there is nothing in CPU for controlling/monitoring L3-cache.

We probably have a different understanding of "transparent". It doesn't look to me like any more transparent than the L1 or L2 cache. Basically, from a software perspective, it needs maintenance. Whether the CPU explicitly asks the L3 cache for this or the L3 cache figures it on its own based on the L1/L2 operations is irrelevant.

Yes, we have a different understanding regarding "transparent". The goal behind our L3 is to have external cache such that CPU is not aware of it and CPU would work fine even if L3 is turned-off on-the-fly. Further such L3 has to be smarter than normal outer-caches because it will be maintained automatically by observing CPU operations.

...

It would have been transparent if the software didn't need to know about it at all, but it's not the case.

Currently, KVM is the only place where we need L3 maintenance because Guest starts with MMU turned-off otherwise we don't need L3 maintenance in kernel for any kind of IO or Subsystem.

...

...
...
Do you have any configuration bits which would make the L3 completely transparent like always caching even when accesses are non-cacheable and DC ops to PoC ignoring it?

Actually, L3-cache monitors the types of read/write generated by CPU (i.e. whether the request is cacheable/non-cacheable or whether the request is due to DC ops to PoC, or ...).

To answer your query, there is no configuration to have L3 caching when accesses are non-cacheable and DC ops to PoC.

So it's an outer cache with some "improvements" to handle DC ops to PoC. I think it was a pretty bad decision on the hardware side as we really try to get rid of outer caches for many reasons:

Getting rid off outer-caches (such as in this context) may make sense for Embedded/Low-end systems but for Servers L3-cache makes lot of sense.

Claiming this to be a bad decision all depends on what end application you are looking at.

...

Non-standard cache flushing operations (MMIO-based)

It may require cache maintenance by physical address - something hard to get in a guest OS (unless you virtualise L3 cache maintenance)

We don't have cache maintenance by physical address in our L3-cache.

...

Are barriers like DSB propagated correctly? Does a DC op to PoC followed by DSB ensure that the L3 drained the cachelines to RAM?

Yes, DSB works perfectly fine for us with L3. Yes, DC op to PoC followed by DSB works fine with L3 too.

...

I think point 2 isn't required because your L3 detects DC ops to PoC. I hope point 3 is handled correctly (otherwise look how "nice" the mb() macro on arm is to cope with L2x0).

If only 1 is left, we don't need the full outer_cache framework but it still needs to be addressed since the assumption is that flush_cache_all (or __flush_dcache_all) flushes all cache levels. These are not used in generic code but are used during kernel booting, KVM and cpuidle drivers.

...
...
Now, back to the idea of outer_cache framework for arm64. Does your CPU have separate instructions for flushing this L3 cache?

No, CPU does not have separate instruction for flushing L3-cache. On the other hand, L3-cache has MMIO registers which can be use to explicitly flush L3-cache.

I guess you use those in your firmware or boot loader since Linux requires clean/invalidated caches at boot (and I plan to push a patch which removes kernel D-cache cleaning during boot to spot such problems early). A cpuidle driver would probably need this as well.

Yes, cpuidle would be another place where we may need L3-cache maintenance. In other words, if we need to power-off L3-cache from kernel then we need to first flush it.

...

-- Catalin

--Anup

Catalin Marinas

12:52 p.m.

On Fri, Aug 30, 2013 at 11:36:30AM +0100, Anup Patel wrote:

...

On Fri, Aug 30, 2013 at 3:14 PM, Catalin Marinas catalin.marinas@arm.com wrote:

...
On Thu, Aug 29, 2013 at 05:02:50PM +0100, Anup Patel wrote:

...
Actually, L3-cache monitors the types of read/write generated by CPU (i.e. whether the request is cacheable/non-cacheable or whether the request is due to DC ops to PoC, or ...).

To answer your query, there is no configuration to have L3 caching when accesses are non-cacheable and DC ops to PoC.

So it's an outer cache with some "improvements" to handle DC ops to PoC. I think it was a pretty bad decision on the hardware side as we really try to get rid of outer caches for many reasons:

Getting rid off outer-caches (such as in this context) may make sense for Embedded/Low-end systems but for Servers L3-cache makes lot of sense.

Claiming this to be a bad decision all depends on what end application you are looking at.

It's not a bad decision to have big L3 cache, that's a very *good* decision for server applications. The bad part is that it is not fully integrated with the CPU (like allowing set/way operations to flush this cache level).

...

...

Non-standard cache flushing operations (MMIO-based)

It may require cache maintenance by physical address - something hard to get in a guest OS (unless you virtualise L3 cache maintenance)

We don't have cache maintenance by physical address in our L3-cache.

Great.

...

...

Are barriers like DSB propagated correctly? Does a DC op to PoC followed by DSB ensure that the L3 drained the cachelines to RAM?

Yes, DSB works perfectly fine for us with L3. Yes, DC op to PoC followed by DSB works fine with L3 too.

Even better.

See my other comment on flush_dcache_all() use in the kernel/KVM and why I don't think it is suitable (which leaves us with DC ops to PoC in KVM).

...

...
I think point 2 isn't required because your L3 detects DC ops to PoC. I hope point 3 is handled correctly (otherwise look how "nice" the mb() macro on arm is to cope with L2x0).

If only 1 is left, we don't need the full outer_cache framework but it still needs to be addressed since the assumption is that flush_cache_all (or __flush_dcache_all) flushes all cache levels. These are not used in generic code but are used during kernel booting, KVM and cpuidle drivers.

...
...
Now, back to the idea of outer_cache framework for arm64. Does your CPU have separate instructions for flushing this L3 cache?

No, CPU does not have separate instruction for flushing L3-cache. On the other hand, L3-cache has MMIO registers which can be use to explicitly flush L3-cache.

I guess you use those in your firmware or boot loader since Linux requires clean/invalidated caches at boot (and I plan to push a patch which removes kernel D-cache cleaning during boot to spot such problems early). A cpuidle driver would probably need this as well.

Yes, cpuidle would be another place where we may need L3-cache maintenance. In other words, if we need to power-off L3-cache from kernel then we need to first flush it.

The problem is if you need to disable the MMU in the kernel, you would need to flush the L3 cache first. Normally this would be done in the firmware with PSCI but most likely not the case for the Applied hardware.

-- Catalin

Christoffer Dall

16 Aug 16 Aug

5:14 p.m.

On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote:

...

On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...
On 2013-08-15 16:13, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
On 2013-08-15 14:31, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier marc.zyngier@arm.com wrote: > > On 2013-08-15 07:26, Anup Patel wrote: >> >> >> Hi Marc, >> >> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >> marc.zyngier@arm.com >> wrote: >>> >>> >>> Hi Anup, >>> >>> >>> On 2013-08-14 15:22, Anup Patel wrote: >>>> >>>> >>>> >>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>> marc.zyngier@arm.com >>>> wrote: >>>>> >>>>> >>>>> >>>>> Hi Pranav, >>>>> >>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>> >>>>>> >>>>>> >>>>>> Systems with large external L3-cache (few MBs), might have >>>>>> dirty >>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>> this, >>>>>> we need to flush such dirty content from d-cache so that >>>>>> guest >>>>>> will see correct contents of guest page when guest MMU is >>>>>> disabled. >>>>>> >>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>> L3-cache. >>>>>> >>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>> pranavkumar@linaro.org >>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>>>> --- >>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>> 1 file changed, 2 insertions(+) >>>>>> >>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>> index efe609c..5129038 100644 >>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>> + /* Flush d-cache for systems with external >>>>>> caches. */ >>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>> ASID-tagged >>>>>> VIVT >>>>>> */ >>>>>> /* any kind of VIPT cache */ >>>>>> __flush_icache_all(); >>>>> >>>>> >>>>> >>>>> >>>>> [adding Will to the discussion as we talked about this in the >>>>> past] >>>>> >>>>> That's definitely an issue, but I'm not sure the fix is to hit >>>>> the >>>>> data >>>>> cache on each page mapping. It looks overkill. >>>>> >>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>> kvmtools >>>>> knows which bits of the guest memory have been touched, and >>>>> can do a >>>>> "DC >>>>> DVAC" on this region. >>>> >>>> >>>> >>>> >>>> It seems a bit unnatural to have cache cleaning is user-space. >>>> I am >>>> sure >>>> other architectures don't have such cache cleaning in >>>> user-space for >>>> KVM. >>>> >>>>> >>>>> The alternative is do it in the kernel before running any vcpu >>>>> - but >>>>> that's not very nice either (have to clean the whole of the >>>>> guest >>>>> memory, which makes a full dcache clean more appealing). >>>> >>>> >>>> >>>> >>>> Actually, cleaning full d-cache by set/way upon first run of >>>> VCPU was >>>> our second option but current approach seemed very simple hence >>>> we went for this. >>>> >>>> If more people vote for full d-cache clean upon first run of >>>> VCPU then >>>> we should revise this patch. >>> >>> >>> >>> >>> Can you please give the attached patch a spin on your HW? I've >>> boot-tested >>> it on a model, but of course I can't really verify that it fixes >>> your >>> issue. >>> >>> As far as I can see, it should fix it without any additional >>> flushing. >>> >>> Please let me know how it goes. >> >> >> >> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >> treated as "Normal Non-shareable, Inner Write-Back >> Write-Allocate, >> Outer Write-Back Write-Allocate memory" >> >> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >> treated as "Strongly-ordered device memory" >> >> Now if Guest/VM access hardware MMIO devices directly (such as >> VGIC CPU interface) with MMU off then MMIO devices will be >> accessed as normal memory when HCR_EL2.DC=1. > > > > I don't think so. Stage-2 still applies, and should force MMIO to > be > accessed as device memory. > > >> The HCR_EL2.DC=1 makes sense only if we have all software >> emulated devices for Guest/VM which is not true for KVM ARM or >> KVM ARM64 because we use VGIC. >> >> IMHO, this patch enforces incorrect memory attribute for Guest/VM >> when Stage1 MMU is off. > > > > See above. My understanding is that HCR.DC controls the default > output of > Stage-1, and Stage-2 overrides still apply.

You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2.

Please refer to this patch https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.
    M.
-- Fast, cheap, reliable. Pick two.
HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?
Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?
Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

Read my question again: The bootloaders running with the MMU off in a VM will only disable the MMU for Stage-1 translations. Stage-2 translations are still enforced using hte Stage-2 page tables and their attributes for all mappings to devices will still enforce strongly-ordered device type memory.

Again, what does it matter if the device is DMA-capable or not?

-Christoffer

Anup Patel

5:18 p.m.

On 16 August 2013 22:44, Christoffer Dall christoffer.dall@linaro.orgwrote:

...

On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...
On 2013-08-15 16:13, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com

...
...
...
wrote:

...
On 2013-08-15 14:31, Anup Patel wrote: > > On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > marc.zyngier@arm.com > wrote: >> >> On 2013-08-15 07:26, Anup Patel wrote: >>> >>> >>> Hi Marc, >>> >>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >>> marc.zyngier@arm.com >>> wrote: >>>> >>>> >>>> Hi Anup, >>>> >>>> >>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>> >>>>> >>>>> >>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>>> marc.zyngier@arm.com >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi Pranav, >>>>>> >>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Systems with large external L3-cache (few MBs), might have >>>>>>> dirty >>>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>>> this, >>>>>>> we need to flush such dirty content from d-cache so that >>>>>>> guest >>>>>>> will see correct contents of guest page when guest MMU is >>>>>>> disabled. >>>>>>> >>>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>>> L3-cache. >>>>>>> >>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>>> pranavkumar@linaro.org >>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>>>>> --- >>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>> 1 file changed, 2 insertions(+) >>>>>>> >>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>> index efe609c..5129038 100644 >>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>> + /* Flush d-cache for systems with external >>>>>>> caches. */ >>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>>> ASID-tagged >>>>>>> VIVT >>>>>>> */ >>>>>>> /* any kind of VIPT cache */ >>>>>>> __flush_icache_all(); >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> [adding Will to the discussion as we talked about this in the >>>>>> past] >>>>>> >>>>>> That's definitely an issue, but I'm not sure the fix is to

hit

...
...
...
...
...
>>>>>> the >>>>>> data >>>>>> cache on each page mapping. It looks overkill. >>>>>> >>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>> kvmtools >>>>>> knows which bits of the guest memory have been touched, and >>>>>> can do a >>>>>> "DC >>>>>> DVAC" on this region. >>>>> >>>>> >>>>> >>>>> >>>>> It seems a bit unnatural to have cache cleaning is user-space. >>>>> I am >>>>> sure >>>>> other architectures don't have such cache cleaning in >>>>> user-space for >>>>> KVM. >>>>> >>>>>> >>>>>> The alternative is do it in the kernel before running any

vcpu

...
...
...
...
...
>>>>>> - but >>>>>> that's not very nice either (have to clean the whole of the >>>>>> guest >>>>>> memory, which makes a full dcache clean more appealing). >>>>> >>>>> >>>>> >>>>> >>>>> Actually, cleaning full d-cache by set/way upon first run of >>>>> VCPU was >>>>> our second option but current approach seemed very simple

hence

...
...
...
...
...
>>>>> we went for this. >>>>> >>>>> If more people vote for full d-cache clean upon first run of >>>>> VCPU then >>>>> we should revise this patch. >>>> >>>> >>>> >>>> >>>> Can you please give the attached patch a spin on your HW? I've >>>> boot-tested >>>> it on a model, but of course I can't really verify that it

fixes

...
...
...
...
...
>>>> your >>>> issue. >>>> >>>> As far as I can see, it should fix it without any additional >>>> flushing. >>>> >>>> Please let me know how it goes. >>> >>> >>> >>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>> treated as "Normal Non-shareable, Inner Write-Back >>> Write-Allocate, >>> Outer Write-Back Write-Allocate memory" >>> >>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>> treated as "Strongly-ordered device memory" >>> >>> Now if Guest/VM access hardware MMIO devices directly (such as >>> VGIC CPU interface) with MMU off then MMIO devices will be >>> accessed as normal memory when HCR_EL2.DC=1. >> >> >> >> I don't think so. Stage-2 still applies, and should force MMIO to >> be >> accessed as device memory. >> >> >>> The HCR_EL2.DC=1 makes sense only if we have all software >>> emulated devices for Guest/VM which is not true for KVM ARM or >>> KVM ARM64 because we use VGIC. >>> >>> IMHO, this patch enforces incorrect memory attribute for

Guest/VM

...
...
...
...
...
>>> when Stage1 MMU is off. >> >> >> >> See above. My understanding is that HCR.DC controls the default >> output of >> Stage-1, and Stage-2 overrides still apply. > > > You had mentioned that PAGE_S2_DEVICE attribute was redundant > and wanted guest to decide the memory attribute. In other words, > you > did not want to enforce any memory attribute in Stage2. > > Please refer to this patch > https://patchwork.kernel.org/patch/2543201/

This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason:

https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

...
...
...
...
...
So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning.
    M.
-- Fast, cheap, reliable. Pick two.
HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).
OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out
of

...
...
...
curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?

Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

Read my question again: The bootloaders running with the MMU off in a VM will only disable the MMU for Stage-1 translations. Stage-2 translations are still enforced using hte Stage-2 page tables and their attributes for all mappings to devices will still enforce strongly-ordered device type memory.

Please read my reply again. Also try to read-up SATA AHCI spec.

...

Again, what does it matter if the device is DMA-capable or not?

-Christoffer _______________________________________________ kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Christoffer Dall

5:28 p.m.

On Fri, Aug 16, 2013 at 10:48:39PM +0530, Anup Patel wrote:

...

On 16 August 2013 22:44, Christoffer Dall christoffer.dall@linaro.orgwrote:

...
On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...
On 2013-08-15 16:13, Anup Patel wrote:

...
Hi Marc,

On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com

...
...
...
wrote: > On 2013-08-15 14:31, Anup Patel wrote: >> >> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >> marc.zyngier@arm.com >> wrote: >>> >>> On 2013-08-15 07:26, Anup Patel wrote: >>>> >>>> >>>> Hi Marc, >>>> >>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >>>> marc.zyngier@arm.com >>>> wrote: >>>>> >>>>> >>>>> Hi Anup, >>>>> >>>>> >>>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>>>> marc.zyngier@arm.com >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Pranav, >>>>>>> >>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Systems with large external L3-cache (few MBs), might have >>>>>>>> dirty >>>>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>>>> this, >>>>>>>> we need to flush such dirty content from d-cache so that >>>>>>>> guest >>>>>>>> will see correct contents of guest page when guest MMU is >>>>>>>> disabled. >>>>>>>> >>>>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>>>> L3-cache. >>>>>>>> >>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>>>> pranavkumar@linaro.org >>>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>>>>>> --- >>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>> >>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> index efe609c..5129038 100644 >>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>>> + /* Flush d-cache for systems with external >>>>>>>> caches. */ >>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>>>> ASID-tagged >>>>>>>> VIVT >>>>>>>> */ >>>>>>>> /* any kind of VIPT cache */ >>>>>>>> __flush_icache_all(); >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> [adding Will to the discussion as we talked about this in the >>>>>>> past] >>>>>>> >>>>>>> That's definitely an issue, but I'm not sure the fix is to

hit

...
...
...
...
>>>>>>> the >>>>>>> data >>>>>>> cache on each page mapping. It looks overkill. >>>>>>> >>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>>> kvmtools >>>>>>> knows which bits of the guest memory have been touched, and >>>>>>> can do a >>>>>>> "DC >>>>>>> DVAC" on this region. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> It seems a bit unnatural to have cache cleaning is user-space. >>>>>> I am >>>>>> sure >>>>>> other architectures don't have such cache cleaning in >>>>>> user-space for >>>>>> KVM. >>>>>> >>>>>>> >>>>>>> The alternative is do it in the kernel before running any

vcpu

...
...
...
...
>>>>>>> - but >>>>>>> that's not very nice either (have to clean the whole of the >>>>>>> guest >>>>>>> memory, which makes a full dcache clean more appealing). >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Actually, cleaning full d-cache by set/way upon first run of >>>>>> VCPU was >>>>>> our second option but current approach seemed very simple

hence

...
...
...
...
>>>>>> we went for this. >>>>>> >>>>>> If more people vote for full d-cache clean upon first run of >>>>>> VCPU then >>>>>> we should revise this patch. >>>>> >>>>> >>>>> >>>>> >>>>> Can you please give the attached patch a spin on your HW? I've >>>>> boot-tested >>>>> it on a model, but of course I can't really verify that it

fixes

...
...
...
...
>>>>> your >>>>> issue. >>>>> >>>>> As far as I can see, it should fix it without any additional >>>>> flushing. >>>>> >>>>> Please let me know how it goes. >>>> >>>> >>>> >>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>>> treated as "Normal Non-shareable, Inner Write-Back >>>> Write-Allocate, >>>> Outer Write-Back Write-Allocate memory" >>>> >>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>>> treated as "Strongly-ordered device memory" >>>> >>>> Now if Guest/VM access hardware MMIO devices directly (such as >>>> VGIC CPU interface) with MMU off then MMIO devices will be >>>> accessed as normal memory when HCR_EL2.DC=1. >>> >>> >>> >>> I don't think so. Stage-2 still applies, and should force MMIO to >>> be >>> accessed as device memory. >>> >>> >>>> The HCR_EL2.DC=1 makes sense only if we have all software >>>> emulated devices for Guest/VM which is not true for KVM ARM or >>>> KVM ARM64 because we use VGIC. >>>> >>>> IMHO, this patch enforces incorrect memory attribute for

Guest/VM

...
...
...
...
>>>> when Stage1 MMU is off. >>> >>> >>> >>> See above. My understanding is that HCR.DC controls the default >>> output of >>> Stage-1, and Stage-2 overrides still apply. >> >> >> You had mentioned that PAGE_S2_DEVICE attribute was redundant >> and wanted guest to decide the memory attribute. In other words, >> you >> did not want to enforce any memory attribute in Stage2. >> >> Please refer to this patch >> https://patchwork.kernel.org/patch/2543201/ > > > This patch has never been merged. If you carry on following the > discussion, > you will certainly notice it was dropped for a very good reason: >

https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

...
...
...
...
> > So Stage-2 memory attributes are used, they are not going away, and > they are > essential to the patch I sent this morning. > > > M. > -- > Fast, cheap, reliable. Pick two.

HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU).

A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).

OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out

of

...
...
...
curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?

Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

Read my question again: The bootloaders running with the MMU off in a VM will only disable the MMU for Stage-1 translations. Stage-2 translations are still enforced using hte Stage-2 page tables and their attributes for all mappings to devices will still enforce strongly-ordered device type memory.

Please read my reply again. Also try to read-up SATA AHCI spec.

Can you please explain how the specs realate to my question?

The only case I can see is if the guest puts data in a DMA region that it expects the device to read. Is this the case?

Just so we're clear, this is quite different from programming the device through the device memory.

-Christoffer

Christoffer Dall

5:42 p.m.

On Fri, Aug 16, 2013 at 10:28:06AM -0700, Christoffer Dall wrote:

...

On Fri, Aug 16, 2013 at 10:48:39PM +0530, Anup Patel wrote:

...
On 16 August 2013 22:44, Christoffer Dall christoffer.dall@linaro.orgwrote:

...
On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote:

...
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall christoffer.dall@linaro.org wrote:

...
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote:

...
On 2013-08-15 16:13, Anup Patel wrote: > Hi Marc, > > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com

...
...
> wrote: >> On 2013-08-15 14:31, Anup Patel wrote: >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >>> marc.zyngier@arm.com >>> wrote: >>>> >>>> On 2013-08-15 07:26, Anup Patel wrote: >>>>> >>>>> >>>>> Hi Marc, >>>>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >>>>> marc.zyngier@arm.com >>>>> wrote: >>>>>> >>>>>> >>>>>> Hi Anup, >>>>>> >>>>>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>>>>> marc.zyngier@arm.com >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Pranav, >>>>>>>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >>>>>>>>> dirty >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>>>>> this, >>>>>>>>> we need to flush such dirty content from d-cache so that >>>>>>>>> guest >>>>>>>>> will see correct contents of guest page when guest MMU is >>>>>>>>> disabled. >>>>>>>>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>>>>> L3-cache. >>>>>>>>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>>>>> pranavkumar@linaro.org >>>>>>>>> Signed-off-by: Anup Patel anup.patel@linaro.org >>>>>>>>> --- >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> index efe609c..5129038 100644 >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>>>> + /* Flush d-cache for systems with external >>>>>>>>> caches. */ >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>>>>> ASID-tagged >>>>>>>>> VIVT >>>>>>>>> */ >>>>>>>>> /* any kind of VIPT cache */ >>>>>>>>> __flush_icache_all(); >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [adding Will to the discussion as we talked about this in the >>>>>>>> past] >>>>>>>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to

hit

...
...
...
>>>>>>>> the >>>>>>>> data >>>>>>>> cache on each page mapping. It looks overkill. >>>>>>>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>>>> kvmtools >>>>>>>> knows which bits of the guest memory have been touched, and >>>>>>>> can do a >>>>>>>> "DC >>>>>>>> DVAC" on this region. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >>>>>>> I am >>>>>>> sure >>>>>>> other architectures don't have such cache cleaning in >>>>>>> user-space for >>>>>>> KVM. >>>>>>> >>>>>>>> >>>>>>>> The alternative is do it in the kernel before running any

vcpu

...
...
...
>>>>>>>> - but >>>>>>>> that's not very nice either (have to clean the whole of the >>>>>>>> guest >>>>>>>> memory, which makes a full dcache clean more appealing). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >>>>>>> VCPU was >>>>>>> our second option but current approach seemed very simple

hence

...
...
...
>>>>>>> we went for this. >>>>>>> >>>>>>> If more people vote for full d-cache clean upon first run of >>>>>>> VCPU then >>>>>>> we should revise this patch. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Can you please give the attached patch a spin on your HW? I've >>>>>> boot-tested >>>>>> it on a model, but of course I can't really verify that it

fixes

...
...
...
>>>>>> your >>>>>> issue. >>>>>> >>>>>> As far as I can see, it should fix it without any additional >>>>>> flushing. >>>>>> >>>>>> Please let me know how it goes. >>>>> >>>>> >>>>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>>>> treated as "Normal Non-shareable, Inner Write-Back >>>>> Write-Allocate, >>>>> Outer Write-Back Write-Allocate memory" >>>>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>>>> treated as "Strongly-ordered device memory" >>>>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >>>>> accessed as normal memory when HCR_EL2.DC=1. >>>> >>>> >>>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >>>> be >>>> accessed as device memory. >>>> >>>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >>>>> KVM ARM64 because we use VGIC. >>>>> >>>>> IMHO, this patch enforces incorrect memory attribute for

Guest/VM

...
...
...
>>>>> when Stage1 MMU is off. >>>> >>>> >>>> >>>> See above. My understanding is that HCR.DC controls the default >>>> output of >>>> Stage-1, and Stage-2 overrides still apply. >>> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >>> and wanted guest to decide the memory attribute. In other words, >>> you >>> did not want to enforce any memory attribute in Stage2. >>> >>> Please refer to this patch >>> https://patchwork.kernel.org/patch/2543201/ >> >> >> This patch has never been merged. If you carry on following the >> discussion, >> you will certainly notice it was dropped for a very good reason: >>

https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html

...
...
...
>> >> So Stage-2 memory attributes are used, they are not going away, and >> they are >> essential to the patch I sent this morning. >> >> >> M. >> -- >> Fast, cheap, reliable. Pick two. > > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > is > provided a DMA-capable device in pass-through mode. The reason being > bootloader/firmware typically don't enable MMU and such > bootloader/firmware > will programme a pass-through DMA-capable device without any flushes > to > guest RAM (because it has not enabled MMU). > > A good example of such a device would be SATA AHCI controller given > to a > Guest/VM as direct access (using SystemMMU) and Guest > bootloader/firmware > accessing this SATA AHCI controller to load kernel images from SATA > disk. > In this situation, we will have to hack Guest bootloader/firmware > AHCI driver to > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1).

OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out

of

...
...
...
curiosity: is that a made up example or something you actually have?

Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?

Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context?

Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device.

Read my question again: The bootloaders running with the MMU off in a VM will only disable the MMU for Stage-1 translations. Stage-2 translations are still enforced using hte Stage-2 page tables and their attributes for all mappings to devices will still enforce strongly-ordered device type memory.

Please read my reply again. Also try to read-up SATA AHCI spec.

Can you please explain how the specs realate to my question?

The only case I can see is if the guest puts data in a DMA region that it expects the device to read. Is this the case?

Just so we're clear, this is quite different from programming the device through the device memory.

ok, I see from the specs. It's not just DMA-capable, but actually DMA-programmable. In that case setting DC=1 would be problematic.

Thanks for being so helpful in explaining this ;)

-Christoffer

Pranavkumar Sawargaonkar

15 Aug 15 Aug

8:39 a.m.

Hi Marc,

On 15 August 2013 10:22, Marc Zyngier marc.zyngier@arm.com wrote:

...

Hi Anup,

On 2013-08-14 15:22, Anup Patel wrote:

...
On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier marc.zyngier@arm.com wrote:

...
Hi Pranav,

On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. */
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
        /* any kind of VIPT cache */
        __flush_icache_all();
[adding Will to the discussion as we talked about this in the past]

That's definitely an issue, but I'm not sure the fix is to hit the data cache on each page mapping. It looks overkill.

Wouldn't it be enough to let userspace do the cache cleaning? kvmtools knows which bits of the guest memory have been touched, and can do a "DC DVAC" on this region.
It seems a bit unnatural to have cache cleaning is user-space. I am sure other architectures don't have such cache cleaning in user-space for KVM.

...
The alternative is do it in the kernel before running any vcpu - but that's not very nice either (have to clean the whole of the guest memory, which makes a full dcache clean more appealing).

Actually, cleaning full d-cache by set/way upon first run of VCPU was our second option but current approach seemed very simple hence we went for this.

If more people vote for full d-cache clean upon first run of VCPU then we should revise this patch.
Can you please give the attached patch a spin on your HW? I've boot-tested it on a model, but of course I can't really verify that it fixes your issue.

As far as I can see, it should fix it without any additional flushing.

Please let me know how it goes.

Thanks,
    M.
-- Fast, cheap, reliable. Pick two.

kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Surely we will try this patch.

Just to add few more points to this discussion :

Actually we are already flushing d-cache in "coherent_icache_guest_page" by calling flush_icache_range.

Now in my patch i am doing same thing one more time by calling __flush_dcache_area just below flush_icache_range.

Main difference between d-cache flushing in above two routines is : flush_icache_range is flushing d-cache by point of unification (dc cvau) and __flush_dcache_area is flushing that by point of coherency (dc civac).

Now since second routine is doing it by coherency it is making L3C to be coherent and sync changes with immediate effect and solving the issue.

Thanks, Pranav

Marc Zyngier

3:42 p.m.

Hi Pranav,

On 2013-08-15 09:39, Pranavkumar Sawargaonkar wrote:

[...]

...

Just to add few more points to this discussion :

Actually we are already flushing d-cache in "coherent_icache_guest_page" by calling flush_icache_range.

Now in my patch i am doing same thing one more time by calling __flush_dcache_area just below flush_icache_range.

Main difference between d-cache flushing in above two routines is : flush_icache_range is flushing d-cache by point of unification (dc cvau) and __flush_dcache_area is flushing that by point of coherency (dc civac).

Now since second routine is doing it by coherency it is making L3C to be coherent and sync changes with immediate effect and solving the issue.

I understand that. What I object to is that always cleaning to PoC is expensive (both in terms of latency and energy), and doing so for each page, long after the MMU has been enabled feels like a great loss of performance.

My gut feeling is that a single clean operation at startup time (whether it is from the kernel or userspace) would be a lot better.

Thanks,

-- Fast, cheap, reliable. Pick two.

Sudeep KarkadaNagesha

14 Aug 14 Aug

2:23 p.m.

On 14/08/13 12:47, Pranavkumar Sawargaonkar wrote:

...

Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

I don't understand KVM but the commit message caught my attention.

You are referring to clean the external/outer L3 cache but using internal cache maintenance APIs. You should be using outer_cache APIs if you really intend to do outer cache maintenance ? Is the external/outer L3 cache not unified ?

Or did you mean to flush d-cache to the point of coherency(external L3 in your case)? If so it's not clear from the commit log.

Regards, Sudeep

...

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
/* Flush d-cache for systems with external caches. */
__flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) { /* non ASID-tagged VIVT */ /* any kind of VIPT cache */ __flush_icache_all();

Anup Patel

2:35 p.m.

On Wed, Aug 14, 2013 at 7:53 PM, Sudeep KarkadaNagesha Sudeep.KarkadaNagesha@arm.com wrote:

...

On 14/08/13 12:47, Pranavkumar Sawargaonkar wrote:

...
Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

I don't understand KVM but the commit message caught my attention.

You are referring to clean the external/outer L3 cache but using internal cache maintenance APIs. You should be using outer_cache APIs if you really intend to do outer cache maintenance ? Is the external/outer L3 cache not unified ?

Or did you mean to flush d-cache to the point of coherency(external L3 in your case)? If so it's not clear from the commit log.

Yes, we mean to flush d-cache to the point of coherency (which in our case is external L3-cache).

The patch description is more in-terms of KVM virtualization but, we can certainly add more details.

...

Regards, Sudeep

...
Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
        /* Flush d-cache for systems with external caches. */
        __flush_dcache_area((void *) hva, PAGE_SIZE);
} else if (!icache_is_aivivt()) {       /* non ASID-tagged VIVT */
        /* any kind of VIPT cache */
        __flush_icache_all();
linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Regards, Anup

Christoffer Dall

5:37 p.m.

On Wed, Aug 14, 2013 at 05:17:12PM +0530, Pranavkumar Sawargaonkar wrote:

...

Systems with large external L3-cache (few MBs), might have dirty content belonging to the guest page in L3-cache. To tackle this, we need to flush such dirty content from d-cache so that guest will see correct contents of guest page when guest MMU is disabled.

The patch fixes coherent_icache_guest_page() for external L3-cache.

Signed-off-by: Pranavkumar Sawargaonkar pranavkumar@linaro.org Signed-off-by: Anup Patel anup.patel@linaro.org

arch/arm64/include/asm/kvm_mmu.h | 2 ++ 1 file changed, 2 insertions(+)

diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index efe609c..5129038 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -123,6 +123,8 @@ static inline void coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) if (!icache_is_aliasing()) { /* PIPT */ unsigned long hva = gfn_to_hva(kvm, gfn); flush_icache_range(hva, hva + PAGE_SIZE);
/* Flush d-cache for systems with external caches. */

This comment is nowhere near explanatory enough for someone who comes by later and tries to figure out why we're flushing the dcache on every page we swap in under given circumstances. Yes, you can do git blame, until you modify the line for some other reason and it just becomes a pain.

...

__flush_dcache_area((void *) hva, PAGE_SIZE);

eh, why is this only relevant for a non-aliasing icache? In fact, this does not seem to be icache related at all, but rather instruction-stream to dcache related, which would warrant either a rename of the function to something more generic or a separate function.

...

} else if (!icache_is_aivivt()) { /* non ASID-tagged VIVT */ /* any kind of VIPT cache */ __flush_icache_all(); -- 1.7.9.5

kvmarm mailing list kvmarm@lists.cs.columbia.edu https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

4514

days inactive

4530

days old

linaro-kernel@lists.linaro.org

57 comments

participants

tags (0)

participants (11)

Alexander Graf
Anup Patel
Anup Patel
Catalin Marinas
Christoffer Dall
Marc Zyngier
Marc Zyngier
Peter Maydell
Pranavkumar Sawargaonkar
Sudeep KarkadaNagesha
Will Deacon