[PATCH v3 0/6] KVM: my debug patch queue

List overview All Threads
Download

newer

older

[PATCH AUTOSEL 5.16 27/27]...

[PATCH RESEND v2 0/5] memop...

Maxim Levitsky

11 Aug 2021 11 Aug '21

12:29 p.m.

Hi!

I would like to publish two debug features which were needed for other stuff I work on.

One is the reworked lx-symbols script which now actually works on at least gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the kernel for some reason, not related to this patch) and upstream qemu.

The other feature is the ability to trap all guest exceptions (on SVM for now) and see them in kvmtrace prior to potential merge to double/triple fault.

This can be very useful and I already had to manually patch KVM a few times for this. I will, once time permits, implement this feature on Intel as well.

V2:

* Some more refactoring and workarounds for lx-symbols script

* added KVM_GUESTDBG_BLOCKIRQ flag to enable 'block interrupts on single step' together with KVM_CAP_SET_GUEST_DEBUG2 capability to indicate which guest debug flags are supported.

This is a replacement for unconditional block of interrupts on single step that was done in previous version of this patch set. Patches to qemu to use that feature will be sent soon.

* Reworked the the 'intercept all exceptions for debug' feature according to the review feedback:

- renamed the parameter that enables the feature and moved it to common kvm module. (only SVM part is currently implemented though)

- disable the feature for SEV guests as was suggested during the review - made the vmexit table const again, as was suggested in the review as well.

V3: * Modified a selftest to cover the KVM_GUESTDBG_BLOCKIRQ * Rebased on kvm/queue

Best regards, Maxim Levitsky

Maxim Levitsky (6): KVM: SVM: split svm_handle_invalid_exit KVM: x86: add force_intercept_exceptions_mask KVM: SVM: implement force_intercept_exceptions_mask scripts/gdb: rework lx-symbols gdb script KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

-- 2.26.3

Show replies by date

Maxim Levitsky

11 Aug 11 Aug

12:29 p.m.

New subject: [PATCH v3 1/6] KVM: SVM: split svm_handle_invalid_exit

Split the check for having a vmexit handler to svm_check_exit_valid, and make svm_handle_invalid_exit only handle a vmexit that is already not valid.

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/svm/svm.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 7b58e445a967..e45259177009 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -3236,12 +3236,14 @@ static void dump_vmcb(struct kvm_vcpu *vcpu) "excp_to:", save->last_excp_to); }

-static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code) +static bool svm_check_exit_valid(struct kvm_vcpu *vcpu, u64 exit_code) { - if (exit_code < ARRAY_SIZE(svm_exit_handlers) && - svm_exit_handlers[exit_code]) - return 0; + return (exit_code < ARRAY_SIZE(svm_exit_handlers) && + svm_exit_handlers[exit_code]); +}

+static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code) +{ vcpu_unimpl(vcpu, "svm: unexpected exit reason 0x%llx\n", exit_code); dump_vmcb(vcpu); vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR; @@ -3249,14 +3251,13 @@ static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code) vcpu->run->internal.ndata = 2; vcpu->run->internal.data[0] = exit_code; vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu; - - return -EINVAL; + return 0; }

int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code) { - if (svm_handle_invalid_exit(vcpu, exit_code)) - return 0; + if (!svm_check_exit_valid(vcpu, exit_code)) + return svm_handle_invalid_exit(vcpu, exit_code);

#ifdef CONFIG_RETPOLINE if (exit_code == SVM_EXIT_MSR)

-- 2.26.3

Maxim Levitsky

12:29 p.m.

New subject: [PATCH v3 2/6] KVM: x86: add force_intercept_exceptions_mask

This parameter will be used by VMX and SVM code to force interception of a set of exceptions, given by a bitmask for guest debug and/or kvm debug.

This is based on an idea first shown here: https://patchwork.kernel.org/project/kvm/patch/20160301192822.GD22677@pd.tni...

CC: Borislav Petkov bp@suse.de Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/kvm/x86.c | 3 +++ arch/x86/kvm/x86.h | 2 ++ 2 files changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fdc0c18339fb..092e2fad3c0d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -184,6 +184,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO); int __read_mostly pi_inject_timer = -1; module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);

+uint force_intercept_exceptions_mask; +module_param(force_intercept_exceptions_mask, uint, S_IRUGO | S_IWUSR); +EXPORT_SYMBOL_GPL(force_intercept_exceptions_mask); /* * Restoring the host value for MSRs that are only consumed when running in * usermode, e.g. SYSCALL MSRs and TSC_AUX, can be deferred until the CPU diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 7d66d63dc55a..2b59825213c7 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -347,6 +347,8 @@ extern struct static_key kvm_no_apic_vcpu;

extern bool report_ignored_msrs;

+extern uint force_intercept_exceptions_mask; + static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec) { return pvclock_scale_delta(nsec, vcpu->arch.virtual_tsc_mult,

-- 2.26.3

Sean Christopherson

2 Sep 2 Sep

4:56 p.m.

New subject: [PATCH v3 2/6] KVM: x86: add force_intercept_exceptions_mask

Assuming this hasn't been abandoned...

On Wed, Aug 11, 2021, Maxim Levitsky wrote:

...

This parameter will be used by VMX and SVM code to force interception of a set of exceptions, given by a bitmask for guest debug and/or kvm debug.

This is based on an idea first shown here: https://patchwork.kernel.org/project/kvm/patch/20160301192822.GD22677@pd.tni...

CC: Borislav Petkov bp@suse.de Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

arch/x86/kvm/x86.c | 3 +++ arch/x86/kvm/x86.h | 2 ++ 2 files changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fdc0c18339fb..092e2fad3c0d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -184,6 +184,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO); int __read_mostly pi_inject_timer = -1; module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR); +uint force_intercept_exceptions_mask; +module_param(force_intercept_exceptions_mask, uint, S_IRUGO | S_IWUSR);

Use octal permissions. This also can't be a simple writable param, at least not without a well-documented disclaimer, as there's no guarantee a vCPU will update its exception bitmap in a timely fashion. An alternative to a module param would be to extend/add a per-VM ioctl(), e.g. maybe KVM_SET_GUEST_DEBUG? The downside of an ioctl() is that it would require userspace enabling :-/

Maxim Levitsky

8 Feb 8 Feb

2:34 p.m.

New subject: [PATCH v3 2/6] KVM: x86: add force_intercept_exceptions_mask

On Thu, 2021-09-02 at 16:56 +0000, Sean Christopherson wrote:

...

Assuming this hasn't been abandoned...

On Wed, Aug 11, 2021, Maxim Levitsky wrote:

...
This parameter will be used by VMX and SVM code to force interception of a set of exceptions, given by a bitmask for guest debug and/or kvm debug.

This is based on an idea first shown here: https://patchwork.kernel.org/project/kvm/patch/20160301192822.GD22677@pd.tni...

CC: Borislav Petkov bp@suse.de Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

arch/x86/kvm/x86.c | 3 +++ arch/x86/kvm/x86.h | 2 ++ 2 files changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fdc0c18339fb..092e2fad3c0d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -184,6 +184,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO); int __read_mostly pi_inject_timer = -1; module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR); +uint force_intercept_exceptions_mask; +module_param(force_intercept_exceptions_mask, uint, S_IRUGO | S_IWUSR);

Use octal permissions. This also can't be a simple writable param, at least not without a well-documented disclaimer, as there's no guarantee a vCPU will update its exception bitmap in a timely fashion. An alternative to a module param would be to extend/add a per-VM ioctl(), e.g. maybe KVM_SET_GUEST_DEBUG? The downside of an ioctl() is that it would require userspace enabling :-/

All other module params in this file use macros for permissions, that is why I used them too.

I'll add a comment with a disclaimer here - this is only for debug. I strongly don't want to have this as ioctl as that will indeed need qemu patches, not to mention things like unit tests and which don't even always use qemu.

Or I can make this parameter read-only. I don't mind reloading kvm module when I change this parameter.

Best regards, Maxim Levitsky

PS: Forgot to send this mail, it was in my drafts folder.

Sean Christopherson

8 Mar 8 Mar

11:37 p.m.

New subject: [PATCH v3 2/6] KVM: x86: add force_intercept_exceptions_mask

On Tue, Feb 08, 2022, Maxim Levitsky wrote:

...

On Thu, 2021-09-02 at 16:56 +0000, Sean Christopherson wrote:

...
Assuming this hasn't been abandoned...

On Wed, Aug 11, 2021, Maxim Levitsky wrote:

...
This parameter will be used by VMX and SVM code to force interception of a set of exceptions, given by a bitmask for guest debug and/or kvm debug.

This is based on an idea first shown here: https://patchwork.kernel.org/project/kvm/patch/20160301192822.GD22677@pd.tni...

CC: Borislav Petkov bp@suse.de Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

arch/x86/kvm/x86.c | 3 +++ arch/x86/kvm/x86.h | 2 ++ 2 files changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fdc0c18339fb..092e2fad3c0d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -184,6 +184,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO); int __read_mostly pi_inject_timer = -1; module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR); +uint force_intercept_exceptions_mask; +module_param(force_intercept_exceptions_mask, uint, S_IRUGO | S_IWUSR);

Use octal permissions. This also can't be a simple writable param, at least not without a well-documented disclaimer, as there's no guarantee a vCPU will update its exception bitmap in a timely fashion. An alternative to a module param would be to extend/add a per-VM ioctl(), e.g. maybe KVM_SET_GUEST_DEBUG? The downside of an ioctl() is that it would require userspace enabling :-/

All other module params in this file use macros for permissions, that is why I used them too.

I'll add a comment with a disclaimer here - this is only for debug. I strongly don't want to have this as ioctl as that will indeed need qemu patches, not to mention things like unit tests and which don't even always use qemu.

Or I can make this parameter read-only. I don't mind reloading kvm module when I change this parameter.

Oh! We can force an update, a la nx_huge_pages, where the setter loops through all VMs and does a kvm_make_all_cpus_request() to instruct vCPUs to update their bitmaps. Requires a new request, but that doesn't seem like a huge deal, and it might help pave the way for adding more debug hooks for developers.

The param should also be "unsafe".

E.g. something like

static const struct kernel_param_ops force_ex_intercepts_ops = { .set = set_force_exception_intercepts, .get = get_force_exception_intercepts, }; module_param_cb_unsafe(force_exception_intercepts, &force_ex_intercepts_ops, &force_exception_intercepts, 0644);

Maxim Levitsky

9 Mar 9 Mar

12:31 p.m.

New subject: [PATCH v3 2/6] KVM: x86: add force_intercept_exceptions_mask

On Tue, 2022-03-08 at 23:37 +0000, Sean Christopherson wrote:

...

On Tue, Feb 08, 2022, Maxim Levitsky wrote:

...
On Thu, 2021-09-02 at 16:56 +0000, Sean Christopherson wrote:

...
Assuming this hasn't been abandoned...

On Wed, Aug 11, 2021, Maxim Levitsky wrote:

...
This parameter will be used by VMX and SVM code to force interception of a set of exceptions, given by a bitmask for guest debug and/or kvm debug.

This is based on an idea first shown here: https://patchwork.kernel.org/project/kvm/patch/20160301192822.GD22677@pd.tni...

CC: Borislav Petkov bp@suse.de Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

arch/x86/kvm/x86.c | 3 +++ arch/x86/kvm/x86.h | 2 ++ 2 files changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fdc0c18339fb..092e2fad3c0d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -184,6 +184,9 @@ module_param(force_emulation_prefix, bool, S_IRUGO); int __read_mostly pi_inject_timer = -1; module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR); +uint force_intercept_exceptions_mask; +module_param(force_intercept_exceptions_mask, uint, S_IRUGO | S_IWUSR);

Use octal permissions. This also can't be a simple writable param, at least not without a well-documented disclaimer, as there's no guarantee a vCPU will update its exception bitmap in a timely fashion. An alternative to a module param would be to extend/add a per-VM ioctl(), e.g. maybe KVM_SET_GUEST_DEBUG? The downside of an ioctl() is that it would require userspace enabling :-/

All other module params in this file use macros for permissions, that is why I used them too.

I'll add a comment with a disclaimer here - this is only for debug. I strongly don't want to have this as ioctl as that will indeed need qemu patches, not to mention things like unit tests and which don't even always use qemu.

Or I can make this parameter read-only. I don't mind reloading kvm module when I change this parameter.

Oh! We can force an update, a la nx_huge_pages, where the setter loops through all VMs and does a kvm_make_all_cpus_request() to instruct vCPUs to update their bitmaps. Requires a new request, but that doesn't seem like a huge deal, and it might help pave the way for adding more debug hooks for developers.

Question: is it worth it? Since I am very busy with various things, this feature, beeing just small debug help which I used once in a while doesn't get much time from me.

I can implement this in the future no doubt.

Best regards, Maxim Levitsky

...

The param should also be "unsafe".

I didn't knew about unsafe parameters until recently. I can mark it as such, no objections.

Best regards, Maxim Levitsky

...

E.g. something like

static const struct kernel_param_ops force_ex_intercepts_ops = { .set = set_force_exception_intercepts, .get = get_force_exception_intercepts, }; module_param_cb_unsafe(force_exception_intercepts, &force_ex_intercepts_ops, &force_exception_intercepts, 0644);

Paolo Bonzini

2:03 p.m.

New subject: [PATCH v3 2/6] KVM: x86: add force_intercept_exceptions_mask

On 3/9/22 13:31, Maxim Levitsky wrote:

...

Question: is it worth it? Since I am very busy with various things, this feature, beeing just small debug help which I used once in a while doesn't get much time from me.

I agree it's not very much worth.

Paolo

Sean Christopherson

3:40 p.m.

New subject: [PATCH v3 2/6] KVM: x86: add force_intercept_exceptions_mask

On Wed, Mar 09, 2022, Paolo Bonzini wrote:

...

On 3/9/22 13:31, Maxim Levitsky wrote:

...
Question: is it worth it? Since I am very busy with various things, this feature, beeing just small debug help which I used once in a while doesn't get much time from me.

I agree it's not very much worth.

I don't have a use case, was just trying to find the bottom of my inbox and came across this thread.

Maxim Levitsky

11 Aug 11 Aug

12:29 p.m.

New subject: [PATCH v3 3/6] KVM: SVM: implement force_intercept_exceptions_mask

Currently #TS interception is only done once. Also exception interception is not enabled for SEV guests.

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/svm/svm.c | 70 +++++++++++++++++++++++++++++++++ arch/x86/kvm/svm/svm.h | 6 ++- arch/x86/kvm/x86.c | 5 ++- 4 files changed, 80 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 20daaf67a5bf..72fe03506018 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1690,6 +1690,8 @@ int kvm_emulate_rdpmc(struct kvm_vcpu *vcpu); void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr); void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code); void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload); +void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr, + u32 error_code, unsigned long payload); void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr); void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code); void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index e45259177009..19f54b07161a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -233,6 +233,8 @@ static const u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000}; #define MSRS_RANGE_SIZE 2048 #define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2)

+static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code); + u32 svm_msrpm_offset(u32 msr) { u32 offset; @@ -1153,6 +1155,22 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu, } }

+static void svm_init_force_exceptions_intercepts(struct vcpu_svm *svm) +{ + int exc; + + svm->force_intercept_exceptions_mask = force_intercept_exceptions_mask; + for (exc = 0 ; exc < 32 ; exc++) { + if (!(svm->force_intercept_exceptions_mask & (1 << exc))) + continue; + + /* Those are defined to have undefined behavior in the SVM spec */ + if (exc != 2 && exc != 9) + continue; + set_exception_intercept(svm, exc); + } +} + static void init_vmcb(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -1304,6 +1322,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)

enable_gif(svm);

+ if (!sev_es_guest(vcpu->kvm)) + svm_init_force_exceptions_intercepts(svm); + }

static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) @@ -1892,6 +1913,17 @@ static int pf_interception(struct kvm_vcpu *vcpu) u64 fault_address = svm->vmcb->control.exit_info_2; u64 error_code = svm->vmcb->control.exit_info_1;

+ if ((svm->force_intercept_exceptions_mask & (1 << PF_VECTOR))) + if (npt_enabled && !vcpu->arch.apf.host_apf_flags) { + /* If the #PF was only intercepted for debug, inject + * it directly to the guest, since the kvm's mmu code + * is not ready to deal with such page faults. + */ + kvm_queue_exception_e_p(vcpu, PF_VECTOR, + error_code, fault_address); + return 1; + } + return kvm_handle_page_fault(vcpu, error_code, fault_address, static_cpu_has(X86_FEATURE_DECODEASSISTS) ? svm->vmcb->control.insn_bytes : NULL, @@ -1967,6 +1999,40 @@ static int ac_interception(struct kvm_vcpu *vcpu) return 1; }

+static int gen_exc_interception(struct kvm_vcpu *vcpu) +{ + /* + * Generic exception intercept handler which forwards a guest exception + * as-is to the guest. + * For exceptions that don't have a special intercept handler. + * + * Used only for 'force_intercept_exceptions_mask' KVM debug feature. + */ + struct vcpu_svm *svm = to_svm(vcpu); + int exc = svm->vmcb->control.exit_code - SVM_EXIT_EXCP_BASE; + + /* SVM doesn't provide us with an error code for the #DF */ + u32 err_code = exc == DF_VECTOR ? 0 : svm->vmcb->control.exit_info_1; + + if (!(svm->force_intercept_exceptions_mask & (1 << exc))) + return svm_handle_invalid_exit(vcpu, svm->vmcb->control.exit_code); + + if (exc == TS_VECTOR) { + /* + * SVM doesn't provide us with an error code to be able to + * re-inject the #TS exception, so just disable its + * intercept, and let the guest re-execute the instruction. + */ + vmcb_clr_intercept(&svm->vmcb01.ptr->control, + INTERCEPT_EXCEPTION_OFFSET + TS_VECTOR); + recalc_intercepts(svm); + } else if (x86_exception_has_error_code(exc)) + kvm_queue_exception_e(vcpu, exc, err_code); + else + kvm_queue_exception(vcpu, exc); + return 1; +} + static bool is_erratum_383(void) { int err, i; @@ -3065,6 +3131,10 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_WRITE_DR5] = dr_interception, [SVM_EXIT_WRITE_DR6] = dr_interception, [SVM_EXIT_WRITE_DR7] = dr_interception, + + [SVM_EXIT_EXCP_BASE ... + SVM_EXIT_EXCP_BASE + 31] = gen_exc_interception, + [SVM_EXIT_EXCP_BASE + DB_VECTOR] = db_interception, [SVM_EXIT_EXCP_BASE + BP_VECTOR] = bp_interception, [SVM_EXIT_EXCP_BASE + UD_VECTOR] = ud_interception, diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 524d943f3efc..187ada7c5b03 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -196,6 +196,7 @@ struct vcpu_svm { bool ghcb_sa_free;

bool guest_state_loaded; + u32 force_intercept_exceptions_mask; };

struct svm_cpu_data { @@ -351,8 +352,11 @@ static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit) struct vmcb *vmcb = svm->vmcb01.ptr;

WARN_ON_ONCE(bit >= 32); - vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);

+ if ((1 << bit) & svm->force_intercept_exceptions_mask) + return; + + vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit); recalc_intercepts(svm); }

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 092e2fad3c0d..e5c7b8fa1f7f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -695,12 +695,13 @@ void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, } EXPORT_SYMBOL_GPL(kvm_queue_exception_p);

-static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr, - u32 error_code, unsigned long payload) +void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr, + u32 error_code, unsigned long payload) { kvm_multiple_exception(vcpu, nr, true, error_code, true, payload, false); } +EXPORT_SYMBOL_GPL(kvm_queue_exception_e_p);

int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err) {

-- 2.26.3

Maxim Levitsky

2:26 p.m.

New subject: [PATCH v3 3/6] KVM: SVM: implement force_intercept_exceptions_mask

On Wed, 2021-08-11 at 15:29 +0300, Maxim Levitsky wrote:

...

Currently #TS interception is only done once. Also exception interception is not enabled for SEV guests.

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/svm/svm.c | 70 +++++++++++++++++++++++++++++++++ arch/x86/kvm/svm/svm.h | 6 ++- arch/x86/kvm/x86.c | 5 ++- 4 files changed, 80 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 20daaf67a5bf..72fe03506018 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1690,6 +1690,8 @@ int kvm_emulate_rdpmc(struct kvm_vcpu *vcpu); void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr); void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code); void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload); +void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
	     u32 error_code, unsigned long payload);
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr); void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code); void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index e45259177009..19f54b07161a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -233,6 +233,8 @@ static const u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000}; #define MSRS_RANGE_SIZE 2048 #define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2) +static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code);

u32 svm_msrpm_offset(u32 msr) { u32 offset; @@ -1153,6 +1155,22 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu, } } +static void svm_init_force_exceptions_intercepts(struct vcpu_svm *svm) +{
int exc;

svm->force_intercept_exceptions_mask = force_intercept_exceptions_mask;

for (exc = 0 ; exc < 32 ; exc++) {
if (!(svm->force_intercept_exceptions_mask & (1 << exc)))
	continue;
/* Those are defined to have undefined behavior in the SVM spec */
if (exc != 2 && exc != 9)
	continue;
set_exception_intercept(svm, exc);

I made a mistake here, during one of the refactoring I think, after I finished testing this througfully, and I noticed it now while looking again at the code.

I attached a fix for this, and I also tested more carefully that the feature works with selftests, kvm unit tests and by booting few VMs.

Best regards, Maxim Levitsky

...

}

+}

static void init_vmcb(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -1304,6 +1322,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu) enable_gif(svm);
if (!sev_es_guest(vcpu->kvm))
svm_init_force_exceptions_intercepts(svm);
} static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) @@ -1892,6 +1913,17 @@ static int pf_interception(struct kvm_vcpu *vcpu) u64 fault_address = svm->vmcb->control.exit_info_2; u64 error_code = svm->vmcb->control.exit_info_1;
if ((svm->force_intercept_exceptions_mask & (1 << PF_VECTOR)))
if (npt_enabled && !vcpu->arch.apf.host_apf_flags) {
	/* If the #PF was only intercepted for debug, inject
	 * it directly to the guest, since the kvm's mmu code
	 * is not ready to deal with such page faults.
	 */
	kvm_queue_exception_e_p(vcpu, PF_VECTOR,
				error_code, fault_address);
	return 1;
}
return kvm_handle_page_fault(vcpu, error_code, fault_address, static_cpu_has(X86_FEATURE_DECODEASSISTS) ? svm->vmcb->control.insn_bytes : NULL,
@@ -1967,6 +1999,40 @@ static int ac_interception(struct kvm_vcpu *vcpu) return 1; } +static int gen_exc_interception(struct kvm_vcpu *vcpu) +{
/*
* Generic exception intercept handler which forwards a guest exception
* as-is to the guest.
* For exceptions that don't have a special intercept handler.
*
* Used only for 'force_intercept_exceptions_mask' KVM debug feature.
*/
struct vcpu_svm *svm = to_svm(vcpu);

int exc = svm->vmcb->control.exit_code - SVM_EXIT_EXCP_BASE;

/* SVM doesn't provide us with an error code for the #DF */

u32 err_code = exc == DF_VECTOR ? 0 : svm->vmcb->control.exit_info_1;

if (!(svm->force_intercept_exceptions_mask & (1 << exc)))
return svm_handle_invalid_exit(vcpu, svm->vmcb->control.exit_code);
if (exc == TS_VECTOR) {
/*
 * SVM doesn't provide us with an error code to be able to
 * re-inject the #TS exception, so just disable its
 * intercept, and let the guest re-execute the instruction.
 */
vmcb_clr_intercept(&svm->vmcb01.ptr->control,
		   INTERCEPT_EXCEPTION_OFFSET + TS_VECTOR);
recalc_intercepts(svm);
} else if (x86_exception_has_error_code(exc))
kvm_queue_exception_e(vcpu, exc, err_code);
else
kvm_queue_exception(vcpu, exc);
return 1;
+}

static bool is_erratum_383(void) { int err, i; @@ -3065,6 +3131,10 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_WRITE_DR5] = dr_interception, [SVM_EXIT_WRITE_DR6] = dr_interception, [SVM_EXIT_WRITE_DR7] = dr_interception,

[SVM_EXIT_EXCP_BASE ...

SVM_EXIT_EXCP_BASE + 31] = gen_exc_interception,

[SVM_EXIT_EXCP_BASE + DB_VECTOR] = db_interception, [SVM_EXIT_EXCP_BASE + BP_VECTOR] = bp_interception, [SVM_EXIT_EXCP_BASE + UD_VECTOR] = ud_interception,

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 524d943f3efc..187ada7c5b03 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -196,6 +196,7 @@ struct vcpu_svm { bool ghcb_sa_free; bool guest_state_loaded;

u32 force_intercept_exceptions_mask;

}; struct svm_cpu_data { @@ -351,8 +352,11 @@ static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit) struct vmcb *vmcb = svm->vmcb01.ptr; WARN_ON_ONCE(bit >= 32);

vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);
if ((1 << bit) & svm->force_intercept_exceptions_mask)
return;
vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit); recalc_intercepts(svm);
} diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 092e2fad3c0d..e5c7b8fa1f7f 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -695,12 +695,13 @@ void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, } EXPORT_SYMBOL_GPL(kvm_queue_exception_p); -static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
		    u32 error_code, unsigned long payload)
+void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
	     u32 error_code, unsigned long payload)
{ kvm_multiple_exception(vcpu, nr, true, error_code, true, payload, false); } +EXPORT_SYMBOL_GPL(kvm_queue_exception_e_p); int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err) {

Sean Christopherson

2 Sep 2 Sep

5:34 p.m.

New subject: [PATCH v3 3/6] KVM: SVM: implement force_intercept_exceptions_mask

On Wed, Aug 11, 2021, Maxim Levitsky wrote:

...

On Wed, 2021-08-11 at 15:29 +0300, Maxim Levitsky wrote:

...
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index e45259177009..19f54b07161a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -233,6 +233,8 @@ static const u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000}; #define MSRS_RANGE_SIZE 2048 #define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2) +static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code);

u32 svm_msrpm_offset(u32 msr) { u32 offset; @@ -1153,6 +1155,22 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu, } } +static void svm_init_force_exceptions_intercepts(struct vcpu_svm *svm) +{

int exc;

svm->force_intercept_exceptions_mask = force_intercept_exceptions_mask;

Ah, the param is being snapshotted on vCPU creation, hence the writable module param. That works, though it'd be better to snapshot it on a per-VM basic, not per-vCPU, and do so in common x86 code so that the param doesn't need to be exported.

...

...

for (exc = 0 ; exc < 32 ; exc++) {

for_each_set_bit()

...

if (!(svm->force_intercept_exceptions_mask & (1 << exc)))

```
	continue;
```

/* Those are defined to have undefined behavior in the SVM spec */

```
if (exc != 2 && exc != 9)
```

Maybe add a pr_warn_once() to let the user know they done messed up?

And given that there are already intercepts with undefined behavior, it's probably best to disallow intercepting anything we aren't 100% postive will be handled correctly, e.g. intercepting vector 31 is nonsensical at this time.

...

```
	continue;
```
```
set_exception_intercept(svm, exc);
```

...

...
+static int gen_exc_interception(struct kvm_vcpu *vcpu) +{
/*
* Generic exception intercept handler which forwards a guest exception
* as-is to the guest.
* For exceptions that don't have a special intercept handler.
*
* Used only for 'force_intercept_exceptions_mask' KVM debug feature.
*/
struct vcpu_svm *svm = to_svm(vcpu);

int exc = svm->vmcb->control.exit_code - SVM_EXIT_EXCP_BASE;

/* SVM doesn't provide us with an error code for the #DF */

u32 err_code = exc == DF_VECTOR ? 0 : svm->vmcb->control.exit_info_1;

Might be better to handle this in the x86_exception_has_error_code() path to avoid confusing readers with respect to exceptions that don't have an error code, e.g.

else if (x86_exception_has_error_code(exc)) { /* SVM doesn't provide the error code on #DF :-( */ if (exc == DF_VECTOR) kvm_queue_exception_e(vcpu, exc, 0); else kvm_queue_exception_e(vcpu, exc, svm->vmcb->control.exit_info_1); } else { ... }

Alternatively, can we zero svm->vmcb->control.exit_info_1 on #DF to make it more obvious that SVM leaves stale data in exit_info_1 (assuming that's true)? E.g.

...

if (exc == TS_VECTOR) { ... } else if (x86_exception_has_error_code(exc)) { /* SVM doesn't provide the error code on #DF :-( */ if (exc == DF_VECTOR) svm->vmcb->control.exit_info_1 = 0;

kvm_queue_exception_e(vcpu, exc, svm->vmcb->control.exit_info_1); } else { ... }

...

...

if (!(svm->force_intercept_exceptions_mask & (1 << exc)))

BIT(exc)

...

return svm_handle_invalid_exit(vcpu, svm->vmcb->control.exit_code);

if (exc == TS_VECTOR) {
```
/*
```

 * SVM doesn't provide us with an error code to be able to

 * re-inject the #TS exception, so just disable its

 * intercept, and let the guest re-execute the instruction.

```
 */
```

vmcb_clr_intercept(&svm->vmcb01.ptr->control,

		   INTERCEPT_EXCEPTION_OFFSET + TS_VECTOR);

Maybe just disallow intercepting #TS altogether? Or does this fall into your Win98 use case? :-)

...

...
recalc_intercepts(svm);
} else if (x86_exception_has_error_code(exc))
kvm_queue_exception_e(vcpu, exc, err_code);
else
kvm_queue_exception(vcpu, exc);
return 1;
+}

static bool is_erratum_383(void) { int err, i; @@ -3065,6 +3131,10 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_WRITE_DR5] = dr_interception, [SVM_EXIT_WRITE_DR6] = dr_interception, [SVM_EXIT_WRITE_DR7] = dr_interception,

[SVM_EXIT_EXCP_BASE ...

SVM_EXIT_EXCP_BASE + 31] = gen_exc_interception,

This generates a Sparse warning due to the duplicate initializer. IMO that's a very good warning as I have zero idea how the compiler actually handles this particular scenario, e.g. do later entries take priority, is it technically "undefined" behavior, etc...

arch/x86/kvm/svm/svm.c:3065:10: warning: Initializer entry defined twice arch/x86/kvm/svm/svm.c:3067:29: also defined here

I don't have a clever solution though :-(

...

...

[SVM_EXIT_EXCP_BASE + DB_VECTOR] = db_interception, [SVM_EXIT_EXCP_BASE + BP_VECTOR] = bp_interception, [SVM_EXIT_EXCP_BASE + UD_VECTOR] = ud_interception,

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 524d943f3efc..187ada7c5b03 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -196,6 +196,7 @@ struct vcpu_svm { bool ghcb_sa_free; bool guest_state_loaded;

u32 force_intercept_exceptions_mask;

}; struct svm_cpu_data { @@ -351,8 +352,11 @@ static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit) struct vmcb *vmcb = svm->vmcb01.ptr; WARN_ON_ONCE(bit >= 32);

vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);

if ((1 << bit) & svm->force_intercept_exceptions_mask)

BIT(bit)

...

...
return;
vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit); recalc_intercepts(svm);
}

Maxim Levitsky

8 Feb 8 Feb

2:35 p.m.

New subject: [PATCH v3 3/6] KVM: SVM: implement force_intercept_exceptions_mask

On Thu, 2021-09-02 at 17:34 +0000, Sean Christopherson wrote:

...

On Wed, Aug 11, 2021, Maxim Levitsky wrote:

...
On Wed, 2021-08-11 at 15:29 +0300, Maxim Levitsky wrote:

...
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index e45259177009..19f54b07161a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -233,6 +233,8 @@ static const u32 msrpm_ranges[] = {0, 0xc0000000, 0xc0010000}; #define MSRS_RANGE_SIZE 2048 #define MSRS_IN_RANGE (MSRS_RANGE_SIZE * 8 / 2) +static int svm_handle_invalid_exit(struct kvm_vcpu *vcpu, u64 exit_code);

u32 svm_msrpm_offset(u32 msr) { u32 offset; @@ -1153,6 +1155,22 @@ static void svm_recalc_instruction_intercepts(struct kvm_vcpu *vcpu, } } +static void svm_init_force_exceptions_intercepts(struct vcpu_svm *svm) +{

int exc;

svm->force_intercept_exceptions_mask = force_intercept_exceptions_mask;

Ah, the param is being snapshotted on vCPU creation, hence the writable module param. That works, though it'd be better to snapshot it on a per-VM basic, not per-vCPU, and do so in common x86 code so that the param doesn't need to be exported.

I have nothing against that.

...

...
...

for (exc = 0 ; exc < 32 ; exc++) {

for_each_set_bit()

I used a helper function instead, IMHO a bit cleaner.

...

...
...
if (!(svm->force_intercept_exceptions_mask & (1 << exc)))
	continue;
/* Those are defined to have undefined behavior in the SVM spec */
if (exc != 2 && exc != 9)
Maybe add a pr_warn_once() to let the user know they done messed up?

And given that there are already intercepts with undefined behavior, it's probably best to disallow intercepting anything we aren't 100% postive will be handled correctly, e.g. intercepting vector 31 is nonsensical at this time.

Or I think I'll just drop this check altogether - this is a debug feature anyway.

...

...
...
	continue;
set_exception_intercept(svm, exc);
...

...
...
+static int gen_exc_interception(struct kvm_vcpu *vcpu) +{
/*
* Generic exception intercept handler which forwards a guest exception
* as-is to the guest.
* For exceptions that don't have a special intercept handler.
*
* Used only for 'force_intercept_exceptions_mask' KVM debug feature.
*/
struct vcpu_svm *svm = to_svm(vcpu);

int exc = svm->vmcb->control.exit_code - SVM_EXIT_EXCP_BASE;

/* SVM doesn't provide us with an error code for the #DF */

u32 err_code = exc == DF_VECTOR ? 0 : svm->vmcb->control.exit_info_1;
Might be better to handle this in the x86_exception_has_error_code() path to avoid confusing readers with respect to exceptions that don't have an error code, e.g.

else if (x86_exception_has_error_code(exc)) { /* SVM doesn't provide the error code on #DF :-( */ if (exc == DF_VECTOR) kvm_queue_exception_e(vcpu, exc, 0); else kvm_queue_exception_e(vcpu, exc, svm->vmcb->control.exit_info_1); } else { ... }

Alternatively, can we zero svm->vmcb->control.exit_info_1 on #DF to make it more obvious that SVM leaves stale data in exit_info_1 (assuming that's true)? E.g.

...

if (exc == TS_VECTOR) { ... } else if (x86_exception_has_error_code(exc)) { /* SVM doesn't provide the error code on #DF :-( */ if (exc == DF_VECTOR) svm->vmcb->control.exit_info_1 = 0;
kvm_queue_exception_e(vcpu, exc, svm->vmcb->control.exit_info_1);
} else { ... }

Makes sense.

...

...
...

if (!(svm->force_intercept_exceptions_mask & (1 << exc)))

BIT(exc)

I added a helper function in common x86 code for this.

...

return svm_handle_invalid_exit(vcpu, svm->vmcb->control.exit_code);

if (exc == TS_VECTOR) {
```
/*
```

 * SVM doesn't provide us with an error code to be able to

 * re-inject the #TS exception, so just disable its

 * intercept, and let the guest re-execute the instruction.

```
 */
```

vmcb_clr_intercept(&svm->vmcb01.ptr->control,

		   INTERCEPT_EXCEPTION_OFFSET + TS_VECTOR);

Maybe just disallow intercepting #TS altogether? Or does this fall into your Win98 use case? :-)

Win98 does indeed generate few #TS exceptions but so far I haven't noticed any issues related to task switches. Anyway I would like to intercept as much as possible since this is a debug feature. A single interception is still better that nothing.

...

...
...
recalc_intercepts(svm);
} else if (x86_exception_has_error_code(exc))
kvm_queue_exception_e(vcpu, exc, err_code);
else
kvm_queue_exception(vcpu, exc);
return 1;
+}

static bool is_erratum_383(void) { int err, i; @@ -3065,6 +3131,10 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = { [SVM_EXIT_WRITE_DR5] = dr_interception, [SVM_EXIT_WRITE_DR6] = dr_interception, [SVM_EXIT_WRITE_DR7] = dr_interception,

[SVM_EXIT_EXCP_BASE ...

SVM_EXIT_EXCP_BASE + 31] = gen_exc_interception,
This generates a Sparse warning due to the duplicate initializer. IMO that's a very good warning as I have zero idea how the compiler actually handles this particular scenario, e.g. do later entries take priority, is it technically "undefined" behavior, etc...

arch/x86/kvm/svm/svm.c:3065:10: warning: Initializer entry defined twice arch/x86/kvm/svm/svm.c:3067:29: also defined here

I don't have a clever solution though :-('

Good catch. I thought that this would make sense but standards never make sense. I'll do this manually.

...

...
...

[SVM_EXIT_EXCP_BASE + DB_VECTOR] = db_interception, [SVM_EXIT_EXCP_BASE + BP_VECTOR] = bp_interception, [SVM_EXIT_EXCP_BASE + UD_VECTOR] = ud_interception,

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 524d943f3efc..187ada7c5b03 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -196,6 +196,7 @@ struct vcpu_svm { bool ghcb_sa_free; bool guest_state_loaded;

u32 force_intercept_exceptions_mask;

}; struct svm_cpu_data { @@ -351,8 +352,11 @@ static inline void clr_exception_intercept(struct vcpu_svm *svm, u32 bit) struct vmcb *vmcb = svm->vmcb01.ptr; WARN_ON_ONCE(bit >= 32);

vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit);

if ((1 << bit) & svm->force_intercept_exceptions_mask)

BIT(bit)

Fixed with helper function as well.

...

...
...
return;
vmcb_clr_intercept(&vmcb->control, INTERCEPT_EXCEPTION_OFFSET + bit); recalc_intercepts(svm);
}

Thanks for the review! Best regards, Maxim Levitsky

Maxim Levitsky

11 Aug 11 Aug

12:29 p.m.

New subject: [PATCH v3 4/6] scripts/gdb: rework lx-symbols gdb script

Fix several issues that are present in lx-symbols script:

* Track module unloads by placing another software breakpoint at 'free_module' (force uninline this symbol just in case), and use remove-symbol-file gdb command to unload the symobls of the module that is unloading.

That gives the gdb a chance to mark all software breakpoints from this module as pending again. Also remove the module from the 'known' module list once it is unloaded.

* Since we now track module unload, we don't need to reload all symbols anymore when 'known' module loaded again (that can't happen anymore). This allows reloading a module in the debugged kernel to finish much faster, while lx-symbols tracks module loads and unloads.

* Disable/enable all gdb breakpoints on both module load and unload breakpoint hits, and not only in 'load_all_symbols' as was done before. (load_all_symbols is no longer called on breakpoint hit) That allows gdb to avoid getting confused about the state of the (now two) internal breakpoints we place. Otherwise it will leave them in the kernel code segment, when continuing which triggers a guest kernel panic as soon as it skips over the 'int3' instruction and executes the garbage tail of the optcode on which the breakpoint was placed.

* Block SIGINT while the script is running as this seems to crash gdb

* Add a basic check that kernel is already loaded into the guest memory to avoid confusing errors.

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- kernel/module.c | 8 +- scripts/gdb/linux/symbols.py | 203 +++++++++++++++++++++++------------ 2 files changed, 143 insertions(+), 68 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c index ed13917ea5f3..242bd4bb0b55 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -906,8 +906,12 @@ int module_refcount(struct module *mod) } EXPORT_SYMBOL(module_refcount);

-/* This exists whether we can unload or not */ -static void free_module(struct module *mod); +/* This exists whether we can unload or not + * Keep it uninlined to provide a reliable breakpoint target, + * e.g. for the gdb helper command 'lx-symbols'. + */ + +static noinline void free_module(struct module *mod);

SYSCALL_DEFINE2(delete_module, const char __user *, name_user, unsigned int, flags) diff --git a/scripts/gdb/linux/symbols.py b/scripts/gdb/linux/symbols.py index 08d264ac328b..78e278fb4bad 100644 --- a/scripts/gdb/linux/symbols.py +++ b/scripts/gdb/linux/symbols.py @@ -14,45 +14,23 @@ import gdb import os import re +import signal

from linux import modules, utils

if hasattr(gdb, 'Breakpoint'): - class LoadModuleBreakpoint(gdb.Breakpoint): - def __init__(self, spec, gdb_command): - super(LoadModuleBreakpoint, self).__init__(spec, internal=True) + + class BreakpointWrapper(gdb.Breakpoint): + def __init__(self, callback, **kwargs): + super(BreakpointWrapper, self).__init__(internal=True, **kwargs) self.silent = True - self.gdb_command = gdb_command + self.callback = callback

def stop(self): - module = gdb.parse_and_eval("mod") - module_name = module['name'].string() - cmd = self.gdb_command - - # enforce update if object file is not found - cmd.module_files_updated = False - - # Disable pagination while reporting symbol (re-)loading. - # The console input is blocked in this context so that we would - # get stuck waiting for the user to acknowledge paged output. - show_pagination = gdb.execute("show pagination", to_string=True) - pagination = show_pagination.endswith("on.\n") - gdb.execute("set pagination off") - - if module_name in cmd.loaded_modules: - gdb.write("refreshing all symbols to reload module " - "'{0}'\n".format(module_name)) - cmd.load_all_symbols() - else: - cmd.load_module_symbols(module) - - # restore pagination state - gdb.execute("set pagination %s" % ("on" if pagination else "off")) - + self.callback() return False

- class LxSymbols(gdb.Command): """(Re-)load symbols of Linux kernel and currently loaded modules.

@@ -61,15 +39,52 @@ are scanned recursively, starting in the same directory. Optionally, the module search path can be extended by a space separated list of paths passed to the lx-symbols command."""

- module_paths = [] - module_files = [] - module_files_updated = False - loaded_modules = [] - breakpoint = None - def __init__(self): super(LxSymbols, self).__init__("lx-symbols", gdb.COMMAND_FILES, gdb.COMPLETE_FILENAME) + self.module_paths = [] + self.module_files = [] + self.module_files_updated = False + self.loaded_modules = {} + self.internal_breakpoints = [] + + # prepare GDB for loading/unloading a module + def _prepare_for_module_load_unload(self): + + self.blocked_sigint = False + + # block SIGINT during execution to avoid gdb crash + sigmask = signal.pthread_sigmask(signal.SIG_BLOCK, []) + if not signal.SIGINT in sigmask: + self.blocked_sigint = True + signal.pthread_sigmask(signal.SIG_BLOCK, {signal.SIGINT}) + + # disable all breakpoints to workaround a GDB bug where it would + # not correctly resume from an internal breakpoint we placed + # in do_module_init/free_module (it leaves the int3 + self.saved_breakpoints = [] + if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None: + for bp in gdb.breakpoints(): + self.saved_breakpoints.append({'breakpoint': bp, 'enabled': bp.enabled}) + bp.enabled = False + + # disable pagination to avoid asking user for continue + show_pagination = gdb.execute("show pagination", to_string=True) + self.saved_pagination = show_pagination.endswith("on.\n") + gdb.execute("set pagination off") + + def _unprepare_for_module_load_unload(self): + # restore breakpoint state + for breakpoint in self.saved_breakpoints: + breakpoint['breakpoint'].enabled = breakpoint['enabled'] + + # restore pagination state + gdb.execute("set pagination %s" % ("on" if self.saved_pagination else "off")) + + # unblock SIGINT + if self.blocked_sigint: + sigmask = signal.pthread_sigmask(signal.SIG_UNBLOCK, {signal.SIGINT}) + self.blocked_sigint = False

def _update_module_files(self): self.module_files = [] @@ -107,7 +122,7 @@ lx-symbols command.""" name=section_name, addr=str(address))) return "".join(args)

- def load_module_symbols(self, module): + def _do_load_module_symbols(self, module): module_name = module['name'].string() module_addr = str(module['core_layout']['base']).split()[0]

@@ -130,56 +145,112 @@ lx-symbols command.""" addr=module_addr, sections=self._section_arguments(module)) gdb.execute(cmdline, to_string=True) - if module_name not in self.loaded_modules: - self.loaded_modules.append(module_name) + + self.loaded_modules[module_name] = {"module_file": module_file, + "module_addr": module_addr} else: gdb.write("no module object found for '{0}'\n".format(module_name))

+ + def load_module_symbols(self): + module = gdb.parse_and_eval("mod") + + # module already loaded, false alarm + # can happen if 'do_init_module' breakpoint is hit multiple times + # due to interrupts + module_name = module['name'].string() + if module_name in self.loaded_modules: + gdb.write("spurious module load breakpoint\n") + return + + # enforce update if object file is not found + self.module_files_updated = False + self._prepare_for_module_load_unload() + try: + self._do_load_module_symbols(module) + finally: + self._unprepare_for_module_load_unload() + + + def unload_module_symbols(self): + module = gdb.parse_and_eval("mod") + module_name = module['name'].string() + + # module already unloaded, false alarm + # can happen if 'free_module' breakpoint is hit multiple times + # due to interrupts + if not module_name in self.loaded_modules: + gdb.write("spurious module unload breakpoint\n") + return + + module_file = self.loaded_modules[module_name]["module_file"] + module_addr = self.loaded_modules[module_name]["module_addr"] + + self._prepare_for_module_load_unload() + try: + gdb.write("unloading @{addr}: {filename}\n".format( + addr=module_addr, filename=module_file)) + cmdline = "remove-symbol-file {filename}".format( + filename=module_file) + gdb.execute(cmdline, to_string=True) + del self.loaded_modules[module_name] + + finally: + self._unprepare_for_module_load_unload() + def load_all_symbols(self): gdb.write("loading vmlinux\n")

- # Dropping symbols will disable all breakpoints. So save their states - # and restore them afterward. - saved_states = [] - if hasattr(gdb, 'breakpoints') and not gdb.breakpoints() is None: - for bp in gdb.breakpoints(): - saved_states.append({'breakpoint': bp, 'enabled': bp.enabled}) - - # drop all current symbols and reload vmlinux - orig_vmlinux = 'vmlinux' - for obj in gdb.objfiles(): - if obj.filename.endswith('vmlinux'): - orig_vmlinux = obj.filename - gdb.execute("symbol-file", to_string=True) - gdb.execute("symbol-file {0}".format(orig_vmlinux)) - - self.loaded_modules = [] - module_list = modules.module_list() - if not module_list: - gdb.write("no modules found\n") - else: - [self.load_module_symbols(module) for module in module_list] + self._prepare_for_module_load_unload() + try: + # drop all current symbols and reload vmlinux + orig_vmlinux = 'vmlinux' + for obj in gdb.objfiles(): + if obj.filename.endswith('vmlinux'): + orig_vmlinux = obj.filename + gdb.execute("symbol-file", to_string=True) + gdb.execute("symbol-file {0}".format(orig_vmlinux)) + self.loaded_modules = {} + module_list = modules.module_list() + if not module_list: + gdb.write("no modules found\n") + else: + [self._do_load_module_symbols(module) for module in module_list] + finally: + self._unprepare_for_module_load_unload()

- for saved_state in saved_states: - saved_state['breakpoint'].enabled = saved_state['enabled'] + self._unprepare_for_module_load_unload()

def invoke(self, arg, from_tty): self.module_paths = [os.path.abspath(os.path.expanduser(p)) for p in arg.split()] self.module_paths.append(os.getcwd())

+ try: + gdb.parse_and_eval("*start_kernel").fetch_lazy() + except gdb.MemoryError: + gdb.write("Error: Kernel is not yet loaded\n") + return + # enforce update self.module_files = [] self.module_files_updated = False

+ for bp in self.internal_breakpoints: + bp.delete() + self.internal_breakpoints = [] + self.load_all_symbols()

if hasattr(gdb, 'Breakpoint'): - if self.breakpoint is not None: - self.breakpoint.delete() - self.breakpoint = None - self.breakpoint = LoadModuleBreakpoint( - "kernel/module.c:do_init_module", self) + self.internal_breakpoints.append( + BreakpointWrapper(self.load_module_symbols, + spec="kernel/module.c:do_init_module", + )) + self.internal_breakpoints.append( + BreakpointWrapper(self.unload_module_symbols, + spec="kernel/module.c:free_module", + )) else: gdb.write("Note: symbol update on module loading not supported " "with this gdb version\n")

-- 2.26.3

Maxim Levitsky

12:29 p.m.

New subject: [PATCH v3 5/6] KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ

KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts while running.

This change is mostly intended for more robust single stepping of the guest and it has the following benefits when enabled:

* Resuming from a breakpoint is much more reliable. When resuming execution from a breakpoint, with interrupts enabled, more often than not, KVM would inject an interrupt and make the CPU jump immediately to the interrupt handler and eventually return to the breakpoint, to trigger it again.

From the user point of view it looks like the CPU never executed a single instruction and in some cases that can even prevent forward progress, for example, when the breakpoint is placed by an automated script (e.g lx-symbols), which does something in response to the breakpoint and then continues the guest automatically. If the script execution takes enough time for another interrupt to arrive, the guest will be stuck on the same breakpoint RIP forever.

* Normal single stepping is much more predictable, since it won't land the debugger into an interrupt handler.

* RFLAGS.TF has less chance to be leaked to the guest:

We set that flag behind the guest's back to do single stepping but if single step lands us into an interrupt/exception handler it will be leaked to the guest in the form of being pushed to the stack. This doesn't completely eliminate this problem as exceptions can still happen, but at least this reduces the chances of this happening.

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- Documentation/virt/kvm/api.rst | 1 + arch/x86/include/asm/kvm_host.h | 3 ++- arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/x86.c | 4 ++++ 4 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 86d7ad3a126c..4ea1bb28297b 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -3357,6 +3357,7 @@ flags which can include the following: - KVM_GUESTDBG_INJECT_DB: inject DB type exception [x86] - KVM_GUESTDBG_INJECT_BP: inject BP type exception [x86] - KVM_GUESTDBG_EXIT_PENDING: trigger an immediate guest exit [s390] + - KVM_GUESTDBG_BLOCKIRQ: avoid injecting interrupts/NMI/SMI [x86]

For example KVM_GUESTDBG_USE_SW_BP indicates that software breakpoints are enabled in memory so we need to ensure breakpoint exceptions are diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 72fe03506018..a12db20069d5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -222,7 +222,8 @@ enum x86_intercept_stage; KVM_GUESTDBG_USE_HW_BP | \ KVM_GUESTDBG_USE_SW_BP | \ KVM_GUESTDBG_INJECT_BP | \ - KVM_GUESTDBG_INJECT_DB) + KVM_GUESTDBG_INJECT_DB | \ + KVM_GUESTDBG_BLOCKIRQ)

#define PFERR_PRESENT_BIT 0 diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index a6c327f8ad9e..2ef1f6513c68 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -295,6 +295,7 @@ struct kvm_debug_exit_arch { #define KVM_GUESTDBG_USE_HW_BP 0x00020000 #define KVM_GUESTDBG_INJECT_DB 0x00040000 #define KVM_GUESTDBG_INJECT_BP 0x00080000 +#define KVM_GUESTDBG_BLOCKIRQ 0x00100000

/* for KVM_SET_GUEST_DEBUG */ struct kvm_guest_debug_arch { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e5c7b8fa1f7f..d7fade9e3b94 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8883,6 +8883,10 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit) can_inject = false; }

+ /* Don't inject interrupts if the user asked to avoid doing so */ + if (vcpu->guest_debug & KVM_GUESTDBG_BLOCKIRQ) + return 0; + /* * Finally, inject interrupt events. If an event cannot be injected * due to architectural conditions (e.g. IF=0) a window-open exit

-- 2.26.3

Maxim Levitsky

12:29 p.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

Modify debug_regs test to create a pending interrupt and see that it is blocked when single stepping is done with KVM_GUESTDBG_BLOCKIRQ

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com --- .../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86_64/debug_regs.c b/tools/testing/selftests/kvm/x86_64/debug_regs.c index 6097a8283377..5f078db1bcba 100644 --- a/tools/testing/selftests/kvm/x86_64/debug_regs.c +++ b/tools/testing/selftests/kvm/x86_64/debug_regs.c @@ -8,12 +8,15 @@ #include <string.h> #include "kvm_util.h" #include "processor.h" +#include "apic.h"

#define VCPU_ID 0

#define DR6_BD (1 << 13) #define DR7_GD (1 << 13)

+#define IRQ_VECTOR 0xAA + /* For testing data access debug BP */ uint32_t guest_value;

@@ -21,6 +24,11 @@ extern unsigned char sw_bp, hw_bp, write_data, ss_start, bd_start;

static void guest_code(void) { + /* Create a pending interrupt on current vCPU */ + x2apic_enable(); + x2apic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_INT_ASSERT | + APIC_DM_FIXED | IRQ_VECTOR); + /* * Software BP tests. * @@ -38,12 +46,19 @@ static void guest_code(void) "mov %%rax,%0;\n\t write_data:" : "=m" (guest_value) : : "rax");

- /* Single step test, covers 2 basic instructions and 2 emulated */ + /* + * Single step test, covers 2 basic instructions and 2 emulated + * + * Enable interrupts during the single stepping to see that + * pending interrupt we raised is not handled due to KVM_GUESTDBG_BLOCKIRQ + */ asm volatile("ss_start: " + "sti\n\t" "xor %%eax,%%eax\n\t" "cpuid\n\t" "movl $0x1a0,%%ecx\n\t" "rdmsr\n\t" + "cli\n\t" : : : "eax", "ebx", "ecx", "edx");

/* DR6.BD test */ @@ -72,11 +87,13 @@ int main(void) uint64_t cmd; int i; /* Instruction lengths starting at ss_start */ - int ss_size[4] = { + int ss_size[6] = { + 1, /* sti*/ 2, /* xor */ 2, /* cpuid */ 5, /* mov */ 2, /* rdmsr */ + 1, /* cli */ };

if (!kvm_check_cap(KVM_CAP_SET_GUEST_DEBUG)) { @@ -154,7 +171,8 @@ int main(void) for (i = 0; i < (sizeof(ss_size) / sizeof(ss_size[0])); i++) { target_rip += ss_size[i]; CLEAR_DEBUG(); - debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP; + debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP | + KVM_GUESTDBG_BLOCKIRQ; debug.arch.debugreg[7] = 0x00000400; APPLY_DEBUG(); vcpu_run(vm, VCPU_ID);

-- 2.26.3

Paolo Bonzini

6 Sep 6 Sep

11:20 a.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

On 11/08/21 14:29, Maxim Levitsky wrote:

...

Modify debug_regs test to create a pending interrupt and see that it is blocked when single stepping is done with KVM_GUESTDBG_BLOCKIRQ

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

.../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)

I haven't looked very much at this, but the test fails.

Paolo

...

diff --git a/tools/testing/selftests/kvm/x86_64/debug_regs.c b/tools/testing/selftests/kvm/x86_64/debug_regs.c index 6097a8283377..5f078db1bcba 100644 --- a/tools/testing/selftests/kvm/x86_64/debug_regs.c +++ b/tools/testing/selftests/kvm/x86_64/debug_regs.c @@ -8,12 +8,15 @@ #include <string.h> #include "kvm_util.h" #include "processor.h" +#include "apic.h" #define VCPU_ID 0 #define DR6_BD (1 << 13) #define DR7_GD (1 << 13) +#define IRQ_VECTOR 0xAA

/* For testing data access debug BP */ uint32_t guest_value;

@@ -21,6 +24,11 @@ extern unsigned char sw_bp, hw_bp, write_data, ss_start, bd_start; static void guest_code(void) {
/* Create a pending interrupt on current vCPU */

x2apic_enable();

x2apic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_INT_ASSERT |
	 APIC_DM_FIXED | IRQ_VECTOR);
/*

Software BP tests.
@@ -38,12 +46,19 @@ static void guest_code(void) "mov %%rax,%0;\n\t write_data:" : "=m" (guest_value) : : "rax");

/* Single step test, covers 2 basic instructions and 2 emulated */
/*
* Single step test, covers 2 basic instructions and 2 emulated
*
* Enable interrupts during the single stepping to see that
* pending interrupt we raised is not handled due to KVM_GUESTDBG_BLOCKIRQ
*/
asm volatile("ss_start: "
     "sti\n\t"
"xor %%eax,%%eax\n\t"
"cpuid\n\t"
"movl $0x1a0,%%ecx\n\t"
"rdmsr\n\t"
     "cli\n\t"
: : : "eax", "ebx", "ecx", "edx");
/* DR6.BD test */ @@ -72,11 +87,13 @@ int main(void) uint64_t cmd; int i; /* Instruction lengths starting at ss_start */

int ss_size[4] = {
int ss_size[6] = {
1,		/* sti*/
2, /* xor */ 2, /* cpuid */ 5, /* mov */ 2, /* rdmsr */
1,		/* cli */
};
if (!kvm_check_cap(KVM_CAP_SET_GUEST_DEBUG)) { @@ -154,7 +171,8 @@ int main(void) for (i = 0; i < (sizeof(ss_size) / sizeof(ss_size[0])); i++) { target_rip += ss_size[i]; CLEAR_DEBUG();
debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP |
		KVM_GUESTDBG_BLOCKIRQ;
debug.arch.debugreg[7] = 0x00000400; APPLY_DEBUG(); vcpu_run(vm, VCPU_ID);

Maxim Levitsky

9:03 p.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

On Mon, 2021-09-06 at 13:20 +0200, Paolo Bonzini wrote:

...

On 11/08/21 14:29, Maxim Levitsky wrote:

...
Modify debug_regs test to create a pending interrupt and see that it is blocked when single stepping is done with KVM_GUESTDBG_BLOCKIRQ

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

.../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)

I haven't looked very much at this, but the test fails.

Works for me :-(

[mlevitsk@starship ~/Kernel/master/src/tools/testing/selftests/kvm]$./x86_64/debug_regs [mlevitsk@starship ~/Kernel/master/src/tools/testing/selftests/kvm]$echo $? 0

Maybe you run the test on kernel that doesn't support KVM_GUESTDBG_BLOCKIRQ?

Best regards, Maxim Levitsky

...

Paolo

...
diff --git a/tools/testing/selftests/kvm/x86_64/debug_regs.c b/tools/testing/selftests/kvm/x86_64/debug_regs.c index 6097a8283377..5f078db1bcba 100644 --- a/tools/testing/selftests/kvm/x86_64/debug_regs.c +++ b/tools/testing/selftests/kvm/x86_64/debug_regs.c @@ -8,12 +8,15 @@ #include <string.h> #include "kvm_util.h" #include "processor.h" +#include "apic.h" #define VCPU_ID 0 #define DR6_BD (1 << 13) #define DR7_GD (1 << 13) +#define IRQ_VECTOR 0xAA

/* For testing data access debug BP */ uint32_t guest_value;

@@ -21,6 +24,11 @@ extern unsigned char sw_bp, hw_bp, write_data, ss_start, bd_start; static void guest_code(void) {
/* Create a pending interrupt on current vCPU */

x2apic_enable();

x2apic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_INT_ASSERT |
	 APIC_DM_FIXED | IRQ_VECTOR);
/*

Software BP tests.
@@ -38,12 +46,19 @@ static void guest_code(void) "mov %%rax,%0;\n\t write_data:" : "=m" (guest_value) : : "rax");

/* Single step test, covers 2 basic instructions and 2 emulated */
/*
* Single step test, covers 2 basic instructions and 2 emulated
*
* Enable interrupts during the single stepping to see that
* pending interrupt we raised is not handled due to KVM_GUESTDBG_BLOCKIRQ
*/
asm volatile("ss_start: "
     "sti\n\t"
"xor %%eax,%%eax\n\t"
"cpuid\n\t"
"movl $0x1a0,%%ecx\n\t"
"rdmsr\n\t"
     "cli\n\t"
: : : "eax", "ebx", "ecx", "edx");
/* DR6.BD test */ @@ -72,11 +87,13 @@ int main(void) uint64_t cmd; int i; /* Instruction lengths starting at ss_start */

int ss_size[4] = {
int ss_size[6] = {
1,		/* sti*/
2, /* xor */ 2, /* cpuid */ 5, /* mov */ 2, /* rdmsr */
1,		/* cli */
};
if (!kvm_check_cap(KVM_CAP_SET_GUEST_DEBUG)) { @@ -154,7 +171,8 @@ int main(void) for (i = 0; i < (sizeof(ss_size) / sizeof(ss_size[0])); i++) { target_rip += ss_size[i]; CLEAR_DEBUG();
debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP |
		KVM_GUESTDBG_BLOCKIRQ;
debug.arch.debugreg[7] = 0x00000400; APPLY_DEBUG(); vcpu_run(vm, VCPU_ID);

Vitaly Kuznetsov

1 Nov 1 Nov

3:43 p.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

Paolo Bonzini pbonzini@redhat.com writes:

...

On 11/08/21 14:29, Maxim Levitsky wrote:

...
Modify debug_regs test to create a pending interrupt and see that it is blocked when single stepping is done with KVM_GUESTDBG_BLOCKIRQ

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

.../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)

I haven't looked very much at this, but the test fails.

Same here,

the test passes on AMD but fails consistently on Intel:

# ./x86_64/debug_regs ==== Test Assertion Failure ==== x86_64/debug_regs.c:179: run->exit_reason == KVM_EXIT_DEBUG && run->debug.arch.exception == DB_VECTOR && run->debug.arch.pc == target_rip && run->debug.arch.dr6 == target_dr6 pid=13434 tid=13434 errno=0 - Success 1 0x00000000004027c6: main at debug_regs.c:179 2 0x00007f65344cf554: ?? ??:0 3 0x000000000040294a: _start at ??:? SINGLE_STEP[1]: exit 8 exception 1 rip 0x402a25 (should be 0x402a27) dr6 0xffff4ff0 (should be 0xffff4ff0)

(I know I'm late to the party).

...

Paolo

...
diff --git a/tools/testing/selftests/kvm/x86_64/debug_regs.c b/tools/testing/selftests/kvm/x86_64/debug_regs.c index 6097a8283377..5f078db1bcba 100644 --- a/tools/testing/selftests/kvm/x86_64/debug_regs.c +++ b/tools/testing/selftests/kvm/x86_64/debug_regs.c @@ -8,12 +8,15 @@ #include <string.h> #include "kvm_util.h" #include "processor.h" +#include "apic.h" #define VCPU_ID 0 #define DR6_BD (1 << 13) #define DR7_GD (1 << 13) +#define IRQ_VECTOR 0xAA

/* For testing data access debug BP */ uint32_t guest_value;

@@ -21,6 +24,11 @@ extern unsigned char sw_bp, hw_bp, write_data, ss_start, bd_start; static void guest_code(void) {
/* Create a pending interrupt on current vCPU */

x2apic_enable();

x2apic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_INT_ASSERT |
	 APIC_DM_FIXED | IRQ_VECTOR);
/*

Software BP tests.
@@ -38,12 +46,19 @@ static void guest_code(void) "mov %%rax,%0;\n\t write_data:" : "=m" (guest_value) : : "rax");

/* Single step test, covers 2 basic instructions and 2 emulated */
/*
* Single step test, covers 2 basic instructions and 2 emulated
*
* Enable interrupts during the single stepping to see that
* pending interrupt we raised is not handled due to KVM_GUESTDBG_BLOCKIRQ
*/
asm volatile("ss_start: "
     "sti\n\t"
"xor %%eax,%%eax\n\t"
"cpuid\n\t"
"movl $0x1a0,%%ecx\n\t"
"rdmsr\n\t"
     "cli\n\t"
: : : "eax", "ebx", "ecx", "edx");
/* DR6.BD test */ @@ -72,11 +87,13 @@ int main(void) uint64_t cmd; int i; /* Instruction lengths starting at ss_start */

int ss_size[4] = {
int ss_size[6] = {
1,		/* sti*/
2, /* xor */ 2, /* cpuid */ 5, /* mov */ 2, /* rdmsr */
1,		/* cli */
};
if (!kvm_check_cap(KVM_CAP_SET_GUEST_DEBUG)) { @@ -154,7 +171,8 @@ int main(void) for (i = 0; i < (sizeof(ss_size) / sizeof(ss_size[0])); i++) { target_rip += ss_size[i]; CLEAR_DEBUG();
debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP |
		KVM_GUESTDBG_BLOCKIRQ;
debug.arch.debugreg[7] = 0x00000400; APPLY_DEBUG(); vcpu_run(vm, VCPU_ID);

-- Vitaly

Maxim Levitsky

4:19 p.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

On Mon, 2021-11-01 at 16:43 +0100, Vitaly Kuznetsov wrote:

...

Paolo Bonzini pbonzini@redhat.com writes:

...
On 11/08/21 14:29, Maxim Levitsky wrote:

...
Modify debug_regs test to create a pending interrupt and see that it is blocked when single stepping is done with KVM_GUESTDBG_BLOCKIRQ

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

.../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)

I haven't looked very much at this, but the test fails.

Same here,

the test passes on AMD but fails consistently on Intel:

# ./x86_64/debug_regs ==== Test Assertion Failure ==== x86_64/debug_regs.c:179: run->exit_reason == KVM_EXIT_DEBUG && run->debug.arch.exception == DB_VECTOR && run->debug.arch.pc == target_rip && run->debug.arch.dr6 == target_dr6 pid=13434 tid=13434 errno=0 - Success 1 0x00000000004027c6: main at debug_regs.c:179 2 0x00007f65344cf554: ?? ??:0 3 0x000000000040294a: _start at ??:? SINGLE_STEP[1]: exit 8 exception 1 rip 0x402a25 (should be 0x402a27) dr6 0xffff4ff0 (should be 0xffff4ff0)

(I know I'm late to the party).

Well that is strange. It passes on my intel laptop. Just tested (kvm/queue + qemu master, compiled today) :-(

It fails on iteration 1 (and there is iteration 0) which I think means that we start with RIP on sti, and get #DB on start of xor instruction first (correctly), and then we get #DB again on start of xor instruction again?

Something very strange. My laptop has i7-7600U.

Best regards, Maxim Levitsky

...

...
Paolo

...
diff --git a/tools/testing/selftests/kvm/x86_64/debug_regs.c b/tools/testing/selftests/kvm/x86_64/debug_regs.c index 6097a8283377..5f078db1bcba 100644 --- a/tools/testing/selftests/kvm/x86_64/debug_regs.c +++ b/tools/testing/selftests/kvm/x86_64/debug_regs.c @@ -8,12 +8,15 @@ #include <string.h> #include "kvm_util.h" #include "processor.h" +#include "apic.h" #define VCPU_ID 0 #define DR6_BD (1 << 13) #define DR7_GD (1 << 13) +#define IRQ_VECTOR 0xAA

/* For testing data access debug BP */ uint32_t guest_value;

@@ -21,6 +24,11 @@ extern unsigned char sw_bp, hw_bp, write_data, ss_start, bd_start; static void guest_code(void) {
/* Create a pending interrupt on current vCPU */

x2apic_enable();

x2apic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_INT_ASSERT |
	 APIC_DM_FIXED | IRQ_VECTOR);
/*

Software BP tests.
@@ -38,12 +46,19 @@ static void guest_code(void) "mov %%rax,%0;\n\t write_data:" : "=m" (guest_value) : : "rax");

/* Single step test, covers 2 basic instructions and 2 emulated */
/*
* Single step test, covers 2 basic instructions and 2 emulated
*
* Enable interrupts during the single stepping to see that
* pending interrupt we raised is not handled due to KVM_GUESTDBG_BLOCKIRQ
*/
asm volatile("ss_start: "
     "sti\n\t"
"xor %%eax,%%eax\n\t"
"cpuid\n\t"
"movl $0x1a0,%%ecx\n\t"
"rdmsr\n\t"
     "cli\n\t"
: : : "eax", "ebx", "ecx", "edx");
/* DR6.BD test */ @@ -72,11 +87,13 @@ int main(void) uint64_t cmd; int i; /* Instruction lengths starting at ss_start */

int ss_size[4] = {
int ss_size[6] = {
1,		/* sti*/
2, /* xor */ 2, /* cpuid */ 5, /* mov */ 2, /* rdmsr */
1,		/* cli */
};
if (!kvm_check_cap(KVM_CAP_SET_GUEST_DEBUG)) { @@ -154,7 +171,8 @@ int main(void) for (i = 0; i < (sizeof(ss_size) / sizeof(ss_size[0])); i++) { target_rip += ss_size[i]; CLEAR_DEBUG();
debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP;
debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_SINGLESTEP |
		KVM_GUESTDBG_BLOCKIRQ;
debug.arch.debugreg[7] = 0x00000400; APPLY_DEBUG(); vcpu_run(vm, VCPU_ID);

Sean Christopherson

11:21 p.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

On Mon, Nov 01, 2021, Maxim Levitsky wrote:

...

On Mon, 2021-11-01 at 16:43 +0100, Vitaly Kuznetsov wrote:

...
Paolo Bonzini pbonzini@redhat.com writes:

...
On 11/08/21 14:29, Maxim Levitsky wrote:

...
Modify debug_regs test to create a pending interrupt and see that it is blocked when single stepping is done with KVM_GUESTDBG_BLOCKIRQ

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

.../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)

I haven't looked very much at this, but the test fails.

Same here,

the test passes on AMD but fails consistently on Intel:

# ./x86_64/debug_regs ==== Test Assertion Failure ==== x86_64/debug_regs.c:179: run->exit_reason == KVM_EXIT_DEBUG && run->debug.arch.exception == DB_VECTOR && run->debug.arch.pc == target_rip && run->debug.arch.dr6 == target_dr6 pid=13434 tid=13434 errno=0 - Success 1 0x00000000004027c6: main at debug_regs.c:179 2 0x00007f65344cf554: ?? ??:0 3 0x000000000040294a: _start at ??:? SINGLE_STEP[1]: exit 8 exception 1 rip 0x402a25 (should be 0x402a27) dr6 0xffff4ff0 (should be 0xffff4ff0)

(I know I'm late to the party).

Well that is strange. It passes on my intel laptop. Just tested (kvm/queue + qemu master, compiled today) :-(

It fails on iteration 1 (and there is iteration 0) which I think means that we start with RIP on sti, and get #DB on start of xor instruction first (correctly), and then we get #DB again on start of xor instruction again?

Something very strange. My laptop has i7-7600U.

I haven't verified on hardware, but my guess is that this code in vmx_vcpu_run()

/* When single-stepping over STI and MOV SS, we must clear the * corresponding interruptibility bits in the guest state. Otherwise * vmentry fails as it then expects bit 14 (BS) in pending debug * exceptions being set, but that's not correct for the guest debugging * case. */ if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) vmx_set_interrupt_shadow(vcpu, 0);

interacts badly with APICv=1. It will kill the STI shadow and cause the IRQ in vmcs.GUEST_RVI to be recognized when it (micro-)architecturally should not. My head is going in circles trying to sort out what would actually happen. Maybe comment out that and/or disable APICv to see if either one makes the test pass?

Vitaly Kuznetsov

2 Nov 2 Nov

10:46 a.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

Sean Christopherson seanjc@google.com writes:

...

On Mon, Nov 01, 2021, Maxim Levitsky wrote:

...
On Mon, 2021-11-01 at 16:43 +0100, Vitaly Kuznetsov wrote:

...
Paolo Bonzini pbonzini@redhat.com writes:

...
On 11/08/21 14:29, Maxim Levitsky wrote:

...
Modify debug_regs test to create a pending interrupt and see that it is blocked when single stepping is done with KVM_GUESTDBG_BLOCKIRQ

Signed-off-by: Maxim Levitsky mlevitsk@redhat.com

.../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-)

I haven't looked very much at this, but the test fails.

Same here,

the test passes on AMD but fails consistently on Intel:

# ./x86_64/debug_regs ==== Test Assertion Failure ==== x86_64/debug_regs.c:179: run->exit_reason == KVM_EXIT_DEBUG && run->debug.arch.exception == DB_VECTOR && run->debug.arch.pc == target_rip && run->debug.arch.dr6 == target_dr6 pid=13434 tid=13434 errno=0 - Success 1 0x00000000004027c6: main at debug_regs.c:179 2 0x00007f65344cf554: ?? ??:0 3 0x000000000040294a: _start at ??:? SINGLE_STEP[1]: exit 8 exception 1 rip 0x402a25 (should be 0x402a27) dr6 0xffff4ff0 (should be 0xffff4ff0)

(I know I'm late to the party).

Well that is strange. It passes on my intel laptop. Just tested (kvm/queue + qemu master, compiled today) :-(

It fails on iteration 1 (and there is iteration 0) which I think means that we start with RIP on sti, and get #DB on start of xor instruction first (correctly), and then we get #DB again on start of xor instruction again?

Something very strange. My laptop has i7-7600U.

I haven't verified on hardware, but my guess is that this code in vmx_vcpu_run()

/* When single-stepping over STI and MOV SS, we must clear the * corresponding interruptibility bits in the guest state. Otherwise * vmentry fails as it then expects bit 14 (BS) in pending debug * exceptions being set, but that's not correct for the guest debugging * case. */ if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) vmx_set_interrupt_shadow(vcpu, 0);

interacts badly with APICv=1. It will kill the STI shadow and cause the IRQ in vmcs.GUEST_RVI to be recognized when it (micro-)architecturally should not. My head is going in circles trying to sort out what would actually happen. Maybe comment out that and/or disable APICv to see if either one makes the test pass?

Interestingly,

loading 'kvm-intel' with 'enable_apicv=0' makes the test pass, however, commenting out "vmx_set_interrupt_shadow()" as suggested gives a different result (with enable_apicv=1):

# ./x86_64/debug_regs ==== Test Assertion Failure ==== x86_64/debug_regs.c:179: run->exit_reason == KVM_EXIT_DEBUG && run->debug.arch.exception == DB_VECTOR && run->debug.arch.pc == target_rip && run->debug.arch.dr6 == target_dr6 pid=16352 tid=16352 errno=0 - Success 1 0x0000000000402b33: main at debug_regs.c:179 (discriminator 10) 2 0x00007f36401bd554: ?? ??:0 3 0x00000000004023a9: _start at ??:? SINGLE_STEP[1]: exit 9 exception -2147483615 rip 0x1 (should be 0x4024d9) dr6 0xffff4ff0 (should be 0xffff4ff0)

this is a fairly old "Intel(R) Xeon(R) CPU E5-2603 v3".

-- Vitaly

Sean Christopherson

3:53 p.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

On Tue, Nov 02, 2021, Vitaly Kuznetsov wrote:

...

Sean Christopherson seanjc@google.com writes:

...
I haven't verified on hardware, but my guess is that this code in vmx_vcpu_run()

/* When single-stepping over STI and MOV SS, we must clear the * corresponding interruptibility bits in the guest state. Otherwise * vmentry fails as it then expects bit 14 (BS) in pending debug * exceptions being set, but that's not correct for the guest debugging * case. */ if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) vmx_set_interrupt_shadow(vcpu, 0);

interacts badly with APICv=1. It will kill the STI shadow and cause the IRQ in vmcs.GUEST_RVI to be recognized when it (micro-)architecturally should not. My head is going in circles trying to sort out what would actually happen. Maybe comment out that and/or disable APICv to see if either one makes the test pass?

Interestingly,

loading 'kvm-intel' with 'enable_apicv=0' makes the test pass, however, commenting out "vmx_set_interrupt_shadow()" as suggested gives a different result (with enable_apicv=1):

# ./x86_64/debug_regs ==== Test Assertion Failure ==== x86_64/debug_regs.c:179: run->exit_reason == KVM_EXIT_DEBUG && run->debug.arch.exception == DB_VECTOR && run->debug.arch.pc == target_rip && run->debug.arch.dr6 == target_dr6 pid=16352 tid=16352 errno=0 - Success 1 0x0000000000402b33: main at debug_regs.c:179 (discriminator 10) 2 0x00007f36401bd554: ?? ??:0 3 0x00000000004023a9: _start at ??:? SINGLE_STEP[1]: exit 9 exception -2147483615 rip 0x1 (should be 0x4024d9) dr6 0xffff4ff0 (should be 0xffff4ff0)

Exit 9 is KVM_EXIT_FAIL_ENTRY, which in this case VM-Entry likely failed due to invalid guest state because there was STI blocking with single-step enabled but no pending BS #DB:

Bit 14 (BS) must be 1 if the TF flag (bit 8) in the RFLAGS field is 1 and the BTF flag (bit 1) in the IA32_DEBUGCTL field is 0.

Which is precisely what that hack-a-fix avoids. There isn't really a clean solution for legacy single-step, AFAIK the only way to avoid this would be to switch KVM_GUESTDBG_SINGLESTEP to use MTF.

But that mess is a red herring, the test fails with the same signature with APICv=1 if the STI is replaced by PUSHF+BTS+POPFD (to avoid the STI shadow). We all missed this key detail from Vitaly's report:

SINGLE_STEP[1]: exit 8 exception 1 rip 0x402a25 (should be 0x402a27) dr6 0xffff4ff0 (should be 0xffff4ff0) ^^^^^^

Exit '8' is KVM_EXIT_SHUTDOWN, i.e. the arrival of the IRQ hosed the guest because the test doesn't invoke vm_init_descriptor_tables() to install event handlers. The "exception 1" shows up because the run page isn't sanitized by the test, i.e. it's stale data that happens to match.

So I would fully expect this test to fail with AVIC=1. The problem is that KVM_GUESTDBG_BLOCKIRQ does absolutely nothing to handle APICv interrupts. And even if KVM does something to fudge that behavior in the emulated local APIC, the test will then fail miserably virtual IPIs (currently AVIC only).

I stand by my original comment that "Deviating this far from architectural behavior will end in tears at some point." Rather than try to "fix" APICv, I vote to instead either reject KVM_GUESTDBG_BLOCKIRQ if APICv=1, or log a debug message saying that KVM_GUESTDBG_BLOCKIRQ is ineffective with APICv=1.

Vitaly Kuznetsov

4:18 p.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

Sean Christopherson seanjc@google.com writes:

...

On Tue, Nov 02, 2021, Vitaly Kuznetsov wrote:

...
Sean Christopherson seanjc@google.com writes:

...
I haven't verified on hardware, but my guess is that this code in vmx_vcpu_run()

/* When single-stepping over STI and MOV SS, we must clear the * corresponding interruptibility bits in the guest state. Otherwise * vmentry fails as it then expects bit 14 (BS) in pending debug * exceptions being set, but that's not correct for the guest debugging * case. */ if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) vmx_set_interrupt_shadow(vcpu, 0);

interacts badly with APICv=1. It will kill the STI shadow and cause the IRQ in vmcs.GUEST_RVI to be recognized when it (micro-)architecturally should not. My head is going in circles trying to sort out what would actually happen. Maybe comment out that and/or disable APICv to see if either one makes the test pass?

Interestingly,

loading 'kvm-intel' with 'enable_apicv=0' makes the test pass, however, commenting out "vmx_set_interrupt_shadow()" as suggested gives a different result (with enable_apicv=1):

# ./x86_64/debug_regs ==== Test Assertion Failure ==== x86_64/debug_regs.c:179: run->exit_reason == KVM_EXIT_DEBUG && run->debug.arch.exception == DB_VECTOR && run->debug.arch.pc == target_rip && run->debug.arch.dr6 == target_dr6 pid=16352 tid=16352 errno=0 - Success 1 0x0000000000402b33: main at debug_regs.c:179 (discriminator 10) 2 0x00007f36401bd554: ?? ??:0 3 0x00000000004023a9: _start at ??:? SINGLE_STEP[1]: exit 9 exception -2147483615 rip 0x1 (should be 0x4024d9) dr6 0xffff4ff0 (should be 0xffff4ff0)

Exit 9 is KVM_EXIT_FAIL_ENTRY, which in this case VM-Entry likely failed due to invalid guest state because there was STI blocking with single-step enabled but no pending BS #DB:

Bit 14 (BS) must be 1 if the TF flag (bit 8) in the RFLAGS field is 1 and the BTF flag (bit 1) in the IA32_DEBUGCTL field is 0.

Which is precisely what that hack-a-fix avoids. There isn't really a clean solution for legacy single-step, AFAIK the only way to avoid this would be to switch KVM_GUESTDBG_SINGLESTEP to use MTF.

But that mess is a red herring, the test fails with the same signature with APICv=1 if the STI is replaced by PUSHF+BTS+POPFD (to avoid the STI shadow). We all missed this key detail from Vitaly's report:

SINGLE_STEP[1]: exit 8 exception 1 rip 0x402a25 (should be 0x402a27) dr6 0xffff4ff0 (should be 0xffff4ff0) ^^^^^^

Exit '8' is KVM_EXIT_SHUTDOWN, i.e. the arrival of the IRQ hosed the guest because the test doesn't invoke vm_init_descriptor_tables() to install event handlers. The "exception 1" shows up because the run page isn't sanitized by the test, i.e. it's stale data that happens to match.

So I would fully expect this test to fail with AVIC=1. The problem is that KVM_GUESTDBG_BLOCKIRQ does absolutely nothing to handle APICv interrupts. And even if KVM does something to fudge that behavior in the emulated local APIC, the test will then fail miserably virtual IPIs (currently AVIC only).

FWIW, the test doesn't seem to fail on my AMD EPYC system even with "avic=1" ...

...

I stand by my original comment that "Deviating this far from architectural behavior will end in tears at some point." Rather than try to "fix" APICv, I vote to instead either reject KVM_GUESTDBG_BLOCKIRQ if APICv=1, or log a debug message saying that KVM_GUESTDBG_BLOCKIRQ is ineffective with APICv=1.

-- Vitaly

Sean Christopherson

6:45 p.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

On Tue, Nov 02, 2021, Vitaly Kuznetsov wrote:

...

Sean Christopherson seanjc@google.com writes:

...
But that mess is a red herring, the test fails with the same signature with APICv=1 if the STI is replaced by PUSHF+BTS+POPFD (to avoid the STI shadow). We all missed this key detail from Vitaly's report:

SINGLE_STEP[1]: exit 8 exception 1 rip 0x402a25 (should be 0x402a27) dr6 0xffff4ff0 (should be 0xffff4ff0) ^^^^^^

Exit '8' is KVM_EXIT_SHUTDOWN, i.e. the arrival of the IRQ hosed the guest because the test doesn't invoke vm_init_descriptor_tables() to install event handlers. The "exception 1" shows up because the run page isn't sanitized by the test, i.e. it's stale data that happens to match.

So I would fully expect this test to fail with AVIC=1. The problem is that KVM_GUESTDBG_BLOCKIRQ does absolutely nothing to handle APICv interrupts. And even if KVM does something to fudge that behavior in the emulated local APIC, the test will then fail miserably virtual IPIs (currently AVIC only).

FWIW, the test doesn't seem to fail on my AMD EPYC system even with "avic=1" ...

Huh. Assuming the IRQ is pending in the vIRR and KVM didn't screw up elsewhere, that seems like a CPU AVIC bug. #DBs have priority over IRQs, but single-step #DBs are trap-like and KVM (hopefully) isn't injecting a #DB, so a pending IRQ should be taken on the current instruction in the guest when executing VMRUN with guest.EFLAGS.IF=1,TF=1 since there will be a one-instruction delay before the single-step #DB kicks in.

Maxim Levitsky

3 Nov 3 Nov

9:04 a.m.

New subject: [PATCH v3 6/6] KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

On Tue, 2021-11-02 at 18:45 +0000, Sean Christopherson wrote:

...

On Tue, Nov 02, 2021, Vitaly Kuznetsov wrote:

...
Sean Christopherson seanjc@google.com writes:

...
But that mess is a red herring, the test fails with the same signature with APICv=1 if the STI is replaced by PUSHF+BTS+POPFD (to avoid the STI shadow). We all missed this key detail from Vitaly's report:

SINGLE_STEP[1]: exit 8 exception 1 rip 0x402a25 (should be 0x402a27) dr6 0xffff4ff0 (should be 0xffff4ff0) ^^^^^^

Exit '8' is KVM_EXIT_SHUTDOWN, i.e. the arrival of the IRQ hosed the guest because the test doesn't invoke vm_init_descriptor_tables() to install event handlers. The "exception 1" shows up because the run page isn't sanitized by the test, i.e. it's stale data that happens to match.

So I would fully expect this test to fail with AVIC=1. The problem is that KVM_GUESTDBG_BLOCKIRQ does absolutely nothing to handle APICv interrupts. And even if KVM does something to fudge that behavior in the emulated local APIC, the test will then fail miserably virtual IPIs (currently AVIC only).

FWIW, the test doesn't seem to fail on my AMD EPYC system even with "avic=1" ...

Its because AVIC is inhibited for many reasons. In this test x2apic is used, and having x2apic in CPUID inhibits AVIC.

...

Huh. Assuming the IRQ is pending in the vIRR and KVM didn't screw up elsewhere, that seems like a CPU AVIC bug. #DBs have priority over IRQs, but single-step #DBs are trap-like and KVM (hopefully) isn't injecting a #DB, so a pending IRQ should be taken on the current instruction in the guest when executing VMRUN with guest.EFLAGS.IF=1,TF=1 since there will be a one-instruction delay before the single-step #DB kicks in.

We could inhibit AVIC/APICv when KVM_GUESTDBG_BLOCKIRQ is in use, I'll send patch for this soon.

Thanks a lot for finding out what is going on!

Best regards, Maxim Levitsky

Paolo Bonzini

11 Aug 11 Aug

1:10 p.m.

On 11/08/21 14:29, Maxim Levitsky wrote:

...

Hi!

I would like to publish two debug features which were needed for other stuff I work on.

One is the reworked lx-symbols script which now actually works on at least gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the kernel for some reason, not related to this patch) and upstream qemu.

The other feature is the ability to trap all guest exceptions (on SVM for now) and see them in kvmtrace prior to potential merge to double/triple fault.

This can be very useful and I already had to manually patch KVM a few times for this. I will, once time permits, implement this feature on Intel as well.

V2:

Some more refactoring and workarounds for lx-symbols script

added KVM_GUESTDBG_BLOCKIRQ flag to enable 'block interrupts on single step' together with KVM_CAP_SET_GUEST_DEBUG2 capability to indicate which guest debug flags are supported.

This is a replacement for unconditional block of interrupts on single step that was done in previous version of this patch set. Patches to qemu to use that feature will be sent soon.

Reworked the the 'intercept all exceptions for debug' feature according to the review feedback:

renamed the parameter that enables the feature and moved it to common kvm module. (only SVM part is currently implemented though)

disable the feature for SEV guests as was suggested during the review

made the vmexit table const again, as was suggested in the review as well.

V3:

Modified a selftest to cover the KVM_GUESTDBG_BLOCKIRQ

Rebased on kvm/queue

Best regards, Maxim Levitsky

Maxim Levitsky (6): KVM: SVM: split svm_handle_invalid_exit KVM: x86: add force_intercept_exceptions_mask KVM: SVM: implement force_intercept_exceptions_mask scripts/gdb: rework lx-symbols gdb script KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

Documentation/virt/kvm/api.rst | 1 + arch/x86/include/asm/kvm_host.h | 5 +- arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/svm/svm.c | 87 +++++++- arch/x86/kvm/svm/svm.h | 6 +- arch/x86/kvm/x86.c | 12 +- arch/x86/kvm/x86.h | 2 + kernel/module.c | 8 +- scripts/gdb/linux/symbols.py | 203 ++++++++++++------ .../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++- 10 files changed, 266 insertions(+), 83 deletions(-)

Queued 1-5-6.

For patches 2 and 3, please add VMX support too.

For patch 4, it's not KVM :) so please submit it separately.

Paolo

Maxim Levitsky

1:22 p.m.

On Wed, 2021-08-11 at 15:10 +0200, Paolo Bonzini wrote:

...

On 11/08/21 14:29, Maxim Levitsky wrote:

...
Hi!

I would like to publish two debug features which were needed for other stuff I work on.

One is the reworked lx-symbols script which now actually works on at least gdb 9.1 (gdb 9.2 was reported to fail to load the debug symbols from the kernel for some reason, not related to this patch) and upstream qemu.

The other feature is the ability to trap all guest exceptions (on SVM for now) and see them in kvmtrace prior to potential merge to double/triple fault.

This can be very useful and I already had to manually patch KVM a few times for this. I will, once time permits, implement this feature on Intel as well.

V2:

Some more refactoring and workarounds for lx-symbols script

added KVM_GUESTDBG_BLOCKIRQ flag to enable 'block interrupts on single step' together with KVM_CAP_SET_GUEST_DEBUG2 capability to indicate which guest debug flags are supported.

This is a replacement for unconditional block of interrupts on single step that was done in previous version of this patch set. Patches to qemu to use that feature will be sent soon.

Reworked the the 'intercept all exceptions for debug' feature according to the review feedback:

renamed the parameter that enables the feature and moved it to common kvm module. (only SVM part is currently implemented though)

disable the feature for SEV guests as was suggested during the review

made the vmexit table const again, as was suggested in the review as well.

V3:

Modified a selftest to cover the KVM_GUESTDBG_BLOCKIRQ

Rebased on kvm/queue

Best regards, Maxim Levitsky

Maxim Levitsky (6): KVM: SVM: split svm_handle_invalid_exit KVM: x86: add force_intercept_exceptions_mask KVM: SVM: implement force_intercept_exceptions_mask scripts/gdb: rework lx-symbols gdb script KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ KVM: selftests: test KVM_GUESTDBG_BLOCKIRQ

Documentation/virt/kvm/api.rst | 1 + arch/x86/include/asm/kvm_host.h | 5 +- arch/x86/include/uapi/asm/kvm.h | 1 + arch/x86/kvm/svm/svm.c | 87 +++++++- arch/x86/kvm/svm/svm.h | 6 +- arch/x86/kvm/x86.c | 12 +- arch/x86/kvm/x86.h | 2 + kernel/module.c | 8 +- scripts/gdb/linux/symbols.py | 203 ++++++++++++------ .../testing/selftests/kvm/x86_64/debug_regs.c | 24 ++- 10 files changed, 266 insertions(+), 83 deletions(-)

Queued 1-5-6.

For patches 2 and 3, please add VMX support too.

For patch 4, it's not KVM :) so please submit it separately.

Thanks!

I will do this!

Best regards, Maxim Levitsky

...

Paolo

1328

days inactive

1538

days old

linux-kselftest-mirror@lists.linaro.org

27 comments

participants

tags (0)

participants (4)

Maxim Levitsky
Paolo Bonzini
Sean Christopherson
Vitaly Kuznetsov