The first three patches are fixes for XSA-332. The avoid WARN splats and a performance issue with interdomain events.
Patches 4 and 5 are some additions to event handling in order to add some per pv-device statistics to sysfs and the ability to have a per backend device spurious event delay control.
Patches 6 and 7 are minor fixes I had lying around.
Juergen Gross (7): xen/events: reset affinity of 2-level event initially xen/events: don't unmask an event channel when an eoi is pending xen/events: fix lateeoi irq acknowledgement xen/events: link interdomain events to associated xenbus device xen/events: add per-xenbus device event statistics and settings xen/evtch: use smp barriers for user event ring xen/evtchn: read producer index only once
drivers/block/xen-blkback/xenbus.c | 2 +- drivers/net/xen-netback/interface.c | 16 ++-- drivers/xen/events/events_2l.c | 20 +++++ drivers/xen/events/events_base.c | 133 ++++++++++++++++++++++------ drivers/xen/evtchn.c | 6 +- drivers/xen/pvcalls-back.c | 4 +- drivers/xen/xen-pciback/xenbus.c | 2 +- drivers/xen/xen-scsiback.c | 2 +- drivers/xen/xenbus/xenbus_probe.c | 66 ++++++++++++++ include/xen/events.h | 7 +- include/xen/xenbus.h | 7 ++ 11 files changed, 217 insertions(+), 48 deletions(-)
When creating a new event channel with 2-level events the affinity needs to be reset initially in order to avoid using an old affinity from earlier usage of the event channel port.
The same applies to the affinity when onlining a vcpu: all old affinity settings for this vcpu must be reset. As percpu events get initialized before the percpu event channel hook is called, resetting of the affinities happens after offlining a vcpu (this is working, as initial percpu memory is zeroed out).
Cc: stable@vger.kernel.org Reported-by: Julien Grall julien@xen.org Signed-off-by: Juergen Gross jgross@suse.com --- drivers/xen/events/events_2l.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+)
diff --git a/drivers/xen/events/events_2l.c b/drivers/xen/events/events_2l.c index da87f3a1e351..23217940144a 100644 --- a/drivers/xen/events/events_2l.c +++ b/drivers/xen/events/events_2l.c @@ -47,6 +47,16 @@ static unsigned evtchn_2l_max_channels(void) return EVTCHN_2L_NR_CHANNELS; }
+static int evtchn_2l_setup(evtchn_port_t evtchn) +{ + unsigned int cpu; + + for_each_online_cpu(cpu) + clear_bit(evtchn, BM(per_cpu(cpu_evtchn_mask, cpu))); + + return 0; +} + static void evtchn_2l_bind_to_cpu(evtchn_port_t evtchn, unsigned int cpu, unsigned int old_cpu) { @@ -355,9 +365,18 @@ static void evtchn_2l_resume(void) EVTCHN_2L_NR_CHANNELS/BITS_PER_EVTCHN_WORD); }
+static int evtchn_2l_percpu_deinit(unsigned int cpu) +{ + memset(per_cpu(cpu_evtchn_mask, cpu), 0, sizeof(xen_ulong_t) * + EVTCHN_2L_NR_CHANNELS/BITS_PER_EVTCHN_WORD); + + return 0; +} + static const struct evtchn_ops evtchn_ops_2l = { .max_channels = evtchn_2l_max_channels, .nr_channels = evtchn_2l_max_channels, + .setup = evtchn_2l_setup, .bind_to_cpu = evtchn_2l_bind_to_cpu, .clear_pending = evtchn_2l_clear_pending, .set_pending = evtchn_2l_set_pending, @@ -367,6 +386,7 @@ static const struct evtchn_ops evtchn_ops_2l = { .unmask = evtchn_2l_unmask, .handle_events = evtchn_2l_handle_events, .resume = evtchn_2l_resume, + .percpu_deinit = evtchn_2l_percpu_deinit, };
void __init xen_evtchn_2l_init(void)
Hi Juergen,
On 06/02/2021 10:49, Juergen Gross wrote:
When creating a new event channel with 2-level events the affinity needs to be reset initially in order to avoid using an old affinity from earlier usage of the event channel port.
The same applies to the affinity when onlining a vcpu: all old affinity settings for this vcpu must be reset. As percpu events get initialized before the percpu event channel hook is called, resetting of the affinities happens after offlining a vcpu (this is working, as initial percpu memory is zeroed out).
Cc: stable@vger.kernel.org Reported-by: Julien Grall julien@xen.org Signed-off-by: Juergen Gross jgross@suse.com
drivers/xen/events/events_2l.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+)
diff --git a/drivers/xen/events/events_2l.c b/drivers/xen/events/events_2l.c index da87f3a1e351..23217940144a 100644 --- a/drivers/xen/events/events_2l.c +++ b/drivers/xen/events/events_2l.c @@ -47,6 +47,16 @@ static unsigned evtchn_2l_max_channels(void) return EVTCHN_2L_NR_CHANNELS; } +static int evtchn_2l_setup(evtchn_port_t evtchn) +{
- unsigned int cpu;
- for_each_online_cpu(cpu)
clear_bit(evtchn, BM(per_cpu(cpu_evtchn_mask, cpu)));
The bit corresponding to the event channel can only be set on a single CPU. Could we avoid the loop and instead clear the bit while closing the port?
Cheers,
On 06.02.21 12:20, Julien Grall wrote:
Hi Juergen,
On 06/02/2021 10:49, Juergen Gross wrote:
When creating a new event channel with 2-level events the affinity needs to be reset initially in order to avoid using an old affinity from earlier usage of the event channel port.
The same applies to the affinity when onlining a vcpu: all old affinity settings for this vcpu must be reset. As percpu events get initialized before the percpu event channel hook is called, resetting of the affinities happens after offlining a vcpu (this is working, as initial percpu memory is zeroed out).
Cc: stable@vger.kernel.org Reported-by: Julien Grall julien@xen.org Signed-off-by: Juergen Gross jgross@suse.com
drivers/xen/events/events_2l.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+)
diff --git a/drivers/xen/events/events_2l.c b/drivers/xen/events/events_2l.c index da87f3a1e351..23217940144a 100644 --- a/drivers/xen/events/events_2l.c +++ b/drivers/xen/events/events_2l.c @@ -47,6 +47,16 @@ static unsigned evtchn_2l_max_channels(void) return EVTCHN_2L_NR_CHANNELS; } +static int evtchn_2l_setup(evtchn_port_t evtchn) +{ + unsigned int cpu;
+ for_each_online_cpu(cpu) + clear_bit(evtchn, BM(per_cpu(cpu_evtchn_mask, cpu)));
The bit corresponding to the event channel can only be set on a single CPU. Could we avoid the loop and instead clear the bit while closing the port?
This would need another callback.
Juergen
On 06/02/2021 12:09, Jürgen Groß wrote:
On 06.02.21 12:20, Julien Grall wrote:
Hi Juergen,
On 06/02/2021 10:49, Juergen Gross wrote:
When creating a new event channel with 2-level events the affinity needs to be reset initially in order to avoid using an old affinity from earlier usage of the event channel port.
The same applies to the affinity when onlining a vcpu: all old affinity settings for this vcpu must be reset. As percpu events get initialized before the percpu event channel hook is called, resetting of the affinities happens after offlining a vcpu (this is working, as initial percpu memory is zeroed out).
Cc: stable@vger.kernel.org Reported-by: Julien Grall julien@xen.org Signed-off-by: Juergen Gross jgross@suse.com
drivers/xen/events/events_2l.c | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+)
diff --git a/drivers/xen/events/events_2l.c b/drivers/xen/events/events_2l.c index da87f3a1e351..23217940144a 100644 --- a/drivers/xen/events/events_2l.c +++ b/drivers/xen/events/events_2l.c @@ -47,6 +47,16 @@ static unsigned evtchn_2l_max_channels(void) return EVTCHN_2L_NR_CHANNELS; } +static int evtchn_2l_setup(evtchn_port_t evtchn) +{ + unsigned int cpu;
+ for_each_online_cpu(cpu) + clear_bit(evtchn, BM(per_cpu(cpu_evtchn_mask, cpu)));
The bit corresponding to the event channel can only be set on a single CPU. Could we avoid the loop and instead clear the bit while closing the port?
This would need another callback.
Right, this seems to be better than walking over all the CPUs every time just for cleaning one bit.
Cheers,
An event channel should be kept masked when an eoi is pending for it. When being migrated to another cpu it might be unmasked, though.
In order to avoid this keep two different flags for each event channel to be able to distinguish "normal" masking/unmasking from eoi related masking/unmasking. The event channel should only be able to generate an interrupt if both flags are cleared.
Cc: stable@vger.kernel.org Fixes: 54c9de89895e0a36047 ("xen/events: add a new late EOI evtchn framework") Reported-by: Julien Grall julien@xen.org Signed-off-by: Juergen Gross jgross@suse.com --- drivers/xen/events/events_base.c | 63 +++++++++++++++++++++++++++----- 1 file changed, 53 insertions(+), 10 deletions(-)
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index e850f79351cb..6a836d131e73 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -97,7 +97,9 @@ struct irq_info { short refcnt; u8 spurious_cnt; u8 is_accounted; - enum xen_irq_type type; /* type */ + short type; /* type: IRQT_* */ + bool masked; /* Is event explicitly masked? */ + bool eoi_pending; /* Is EOI pending? */ unsigned irq; evtchn_port_t evtchn; /* event channel */ unsigned short cpu; /* cpu bound */ @@ -302,6 +304,8 @@ static int xen_irq_info_common_setup(struct irq_info *info, info->irq = irq; info->evtchn = evtchn; info->cpu = cpu; + info->masked = true; + info->eoi_pending = false;
ret = set_evtchn_to_irq(evtchn, irq); if (ret < 0) @@ -585,7 +589,10 @@ static void xen_irq_lateeoi_locked(struct irq_info *info, bool spurious) }
info->eoi_time = 0; - unmask_evtchn(evtchn); + info->eoi_pending = false; + + if (!info->masked) + unmask_evtchn(evtchn); }
static void xen_irq_lateeoi_worker(struct work_struct *work) @@ -830,7 +837,11 @@ static unsigned int __startup_pirq(unsigned int irq) goto err;
out: - unmask_evtchn(evtchn); + info->masked = false; + + if (!info->eoi_pending) + unmask_evtchn(evtchn); + eoi_pirq(irq_get_irq_data(irq));
return 0; @@ -857,6 +868,7 @@ static void shutdown_pirq(struct irq_data *data) if (!VALID_EVTCHN(evtchn)) return;
+ info->masked = true; mask_evtchn(evtchn); xen_evtchn_close(evtchn); xen_irq_info_cleanup(info); @@ -1768,18 +1780,26 @@ static int set_affinity_irq(struct irq_data *data, const struct cpumask *dest,
static void enable_dynirq(struct irq_data *data) { - evtchn_port_t evtchn = evtchn_from_irq(data->irq); + struct irq_info *info = info_for_irq(data->irq); + evtchn_port_t evtchn = info ? info->evtchn : 0;
- if (VALID_EVTCHN(evtchn)) - unmask_evtchn(evtchn); + if (VALID_EVTCHN(evtchn)) { + info->masked = false; + + if (!info->eoi_pending) + unmask_evtchn(evtchn); + } }
static void disable_dynirq(struct irq_data *data) { - evtchn_port_t evtchn = evtchn_from_irq(data->irq); + struct irq_info *info = info_for_irq(data->irq); + evtchn_port_t evtchn = info ? info->evtchn : 0;
- if (VALID_EVTCHN(evtchn)) + if (VALID_EVTCHN(evtchn)) { + info->masked = true; mask_evtchn(evtchn); + } }
static void ack_dynirq(struct irq_data *data) @@ -1798,6 +1818,29 @@ static void mask_ack_dynirq(struct irq_data *data) ack_dynirq(data); }
+static void lateeoi_ack_dynirq(struct irq_data *data) +{ + struct irq_info *info = info_for_irq(data->irq); + evtchn_port_t evtchn = info ? info->evtchn : 0; + + if (VALID_EVTCHN(evtchn)) { + info->eoi_pending = true; + mask_evtchn(evtchn); + } +} + +static void lateeoi_mask_ack_dynirq(struct irq_data *data) +{ + struct irq_info *info = info_for_irq(data->irq); + evtchn_port_t evtchn = info ? info->evtchn : 0; + + if (VALID_EVTCHN(evtchn)) { + info->masked = true; + info->eoi_pending = true; + mask_evtchn(evtchn); + } +} + static int retrigger_dynirq(struct irq_data *data) { evtchn_port_t evtchn = evtchn_from_irq(data->irq); @@ -2023,8 +2066,8 @@ static struct irq_chip xen_lateeoi_chip __read_mostly = { .irq_mask = disable_dynirq, .irq_unmask = enable_dynirq,
- .irq_ack = mask_ack_dynirq, - .irq_mask_ack = mask_ack_dynirq, + .irq_ack = lateeoi_ack_dynirq, + .irq_mask_ack = lateeoi_mask_ack_dynirq,
.irq_set_affinity = set_affinity_irq, .irq_retrigger = retrigger_dynirq,
On 06.02.2021 11:49, Juergen Gross wrote:
@@ -1798,6 +1818,29 @@ static void mask_ack_dynirq(struct irq_data *data) ack_dynirq(data); } +static void lateeoi_ack_dynirq(struct irq_data *data) +{
- struct irq_info *info = info_for_irq(data->irq);
- evtchn_port_t evtchn = info ? info->evtchn : 0;
- if (VALID_EVTCHN(evtchn)) {
info->eoi_pending = true;
mask_evtchn(evtchn);
- }
+}
+static void lateeoi_mask_ack_dynirq(struct irq_data *data) +{
- struct irq_info *info = info_for_irq(data->irq);
- evtchn_port_t evtchn = info ? info->evtchn : 0;
- if (VALID_EVTCHN(evtchn)) {
info->masked = true;
info->eoi_pending = true;
mask_evtchn(evtchn);
- }
+}
static int retrigger_dynirq(struct irq_data *data) { evtchn_port_t evtchn = evtchn_from_irq(data->irq); @@ -2023,8 +2066,8 @@ static struct irq_chip xen_lateeoi_chip __read_mostly = { .irq_mask = disable_dynirq, .irq_unmask = enable_dynirq,
- .irq_ack = mask_ack_dynirq,
- .irq_mask_ack = mask_ack_dynirq,
- .irq_ack = lateeoi_ack_dynirq,
- .irq_mask_ack = lateeoi_mask_ack_dynirq,
.irq_set_affinity = set_affinity_irq, .irq_retrigger = retrigger_dynirq,
Unlike the prior handler the two new ones don't call ack_dynirq() anymore, and the description doesn't give a hint towards this difference. As a consequence, clear_evtchn() also doesn't get called anymore - patch 3 adds the calls, but claims an older commit to have been at fault. _If_ ack_dynirq() indeed isn't to be called here, shouldn't the clear_evtchn() calls get added right here?
Jan
On 08.02.21 11:06, Jan Beulich wrote:
On 06.02.2021 11:49, Juergen Gross wrote:
@@ -1798,6 +1818,29 @@ static void mask_ack_dynirq(struct irq_data *data) ack_dynirq(data); } +static void lateeoi_ack_dynirq(struct irq_data *data) +{
- struct irq_info *info = info_for_irq(data->irq);
- evtchn_port_t evtchn = info ? info->evtchn : 0;
- if (VALID_EVTCHN(evtchn)) {
info->eoi_pending = true;
mask_evtchn(evtchn);
- }
+}
+static void lateeoi_mask_ack_dynirq(struct irq_data *data) +{
- struct irq_info *info = info_for_irq(data->irq);
- evtchn_port_t evtchn = info ? info->evtchn : 0;
- if (VALID_EVTCHN(evtchn)) {
info->masked = true;
info->eoi_pending = true;
mask_evtchn(evtchn);
- }
+}
- static int retrigger_dynirq(struct irq_data *data) { evtchn_port_t evtchn = evtchn_from_irq(data->irq);
@@ -2023,8 +2066,8 @@ static struct irq_chip xen_lateeoi_chip __read_mostly = { .irq_mask = disable_dynirq, .irq_unmask = enable_dynirq,
- .irq_ack = mask_ack_dynirq,
- .irq_mask_ack = mask_ack_dynirq,
- .irq_ack = lateeoi_ack_dynirq,
- .irq_mask_ack = lateeoi_mask_ack_dynirq,
.irq_set_affinity = set_affinity_irq, .irq_retrigger = retrigger_dynirq,
Unlike the prior handler the two new ones don't call ack_dynirq() anymore, and the description doesn't give a hint towards this difference. As a consequence, clear_evtchn() also doesn't get called anymore - patch 3 adds the calls, but claims an older commit to have been at fault. _If_ ack_dynirq() indeed isn't to be called here, shouldn't the clear_evtchn() calls get added right here?
There was clearly too much time between writing this patch and looking at its performance impact. :-(
Somehow I managed to overlook that I just introduced the bug here. This OTOH explains why there are not tons of complaints with the current implementation. :-)
Will merge patch 3 into this one.
Juergen
On 2021-02-06 10:49, Juergen Gross wrote:
An event channel should be kept masked when an eoi is pending for it. When being migrated to another cpu it might be unmasked, though.
In order to avoid this keep two different flags for each event channel to be able to distinguish "normal" masking/unmasking from eoi related masking/unmasking. The event channel should only be able to generate an interrupt if both flags are cleared.
Cc: stable@vger.kernel.org Fixes: 54c9de89895e0a36047 ("xen/events: add a new late EOI evtchn framework") Reported-by: Julien Grall julien@xen.org Signed-off-by: Juergen Gross jgross@suse.com
...> +static void lateeoi_ack_dynirq(struct irq_data *data)
+{
- struct irq_info *info = info_for_irq(data->irq);
- evtchn_port_t evtchn = info ? info->evtchn : 0;
- if (VALID_EVTCHN(evtchn)) {
info->eoi_pending = true;
mask_evtchn(evtchn);
- }
+}
Doesn't this (and the one below) need a call to clear_evtchn() to actually ack the pending event? Otherwise I can't see what clears the pending bit.
I tested out this patch but processes using the userspace evtchn device did not work very well without the clear_evtchn() call.
Ross
+static void lateeoi_mask_ack_dynirq(struct irq_data *data) +{
- struct irq_info *info = info_for_irq(data->irq);
- evtchn_port_t evtchn = info ? info->evtchn : 0;
- if (VALID_EVTCHN(evtchn)) {
info->masked = true;
info->eoi_pending = true;
mask_evtchn(evtchn);
- }
+}
When having accepted an irq as result from receiving an event the related event should be cleared. The lateeoi model is missing that, resulting in a continuous stream of events being signalled.
Fixes: 54c9de89895e0a ("xen/events: add a new late EOI evtchn framework") Cc: stable@vger.kernel.org Signed-off-by: Juergen Gross jgross@suse.com --- drivers/xen/events/events_base.c | 2 ++ 1 file changed, 2 insertions(+)
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 6a836d131e73..7b26ef817f8b 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -1826,6 +1826,7 @@ static void lateeoi_ack_dynirq(struct irq_data *data) if (VALID_EVTCHN(evtchn)) { info->eoi_pending = true; mask_evtchn(evtchn); + clear_evtchn(evtchn); } }
@@ -1838,6 +1839,7 @@ static void lateeoi_mask_ack_dynirq(struct irq_data *data) info->masked = true; info->eoi_pending = true; mask_evtchn(evtchn); + clear_evtchn(evtchn); } }
Hi Juergen,
On 06/02/2021 10:49, Juergen Gross wrote:
The first three patches are fixes for XSA-332. The avoid WARN splats and a performance issue with interdomain events.
Thanks for helping to figure out the problem. Unfortunately, I still see reliably the WARN splat with the latest Linux master (1e0d27fce010) + your first 3 patches.
I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L events ABI.
After some debugging, I think I have an idea what's went wrong. The problem happens when the event is initially bound from vCPU0 to a different vCPU.
From the comment in xen_rebind_evtchn_to_cpu(), we are masking the event to prevent it being delivered on an unexpected vCPU. However, I believe the following can happen:
vCPU0 | vCPU1 | | Call xen_rebind_evtchn_to_cpu() receive event X | | mask event X | bind to vCPU1 <vCPU descheduled> | unmask event X | | receive event X | | handle_edge_irq(X) handle_edge_irq(X) | -> handle_irq_event() | -> set IRQD_IN_PROGRESS -> set IRQS_PENDING | | -> evtchn_interrupt() | -> clear IRQD_IN_PROGRESS | -> IRQS_PENDING is set | -> handle_irq_event() | -> evtchn_interrupt() | -> WARN() |
All the lateeoi handlers expect a ONESHOT semantic and evtchn_interrupt() is doesn't tolerate any deviation.
I think the problem was introduced by 7f874a0447a9 ("xen/events: fix lateeoi irq acknowledgment") because the interrupt was disabled previously. Therefore we wouldn't do another iteration in handle_edge_irq().
Aside the handlers, I think it may impact the defer EOI mitigation because in theory if a 3rd vCPU is joining the party (let say vCPU A migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, irq_epoch, eoi_time} could possibly get mangled?
For a fix, we may want to consider to hold evtchn_rwlock with the write permission. Although, I am not 100% sure this is going to prevent everything.
Does my write-up make sense to you?
Cheers,
On 06.02.21 19:46, Julien Grall wrote:
Hi Juergen,
On 06/02/2021 10:49, Juergen Gross wrote:
The first three patches are fixes for XSA-332. The avoid WARN splats and a performance issue with interdomain events.
Thanks for helping to figure out the problem. Unfortunately, I still see reliably the WARN splat with the latest Linux master (1e0d27fce010) + your first 3 patches.
I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L events ABI.
After some debugging, I think I have an idea what's went wrong. The problem happens when the event is initially bound from vCPU0 to a different vCPU.
From the comment in xen_rebind_evtchn_to_cpu(), we are masking the event to prevent it being delivered on an unexpected vCPU. However, I believe the following can happen:
vCPU0 | vCPU1 | | Call xen_rebind_evtchn_to_cpu() receive event X | | mask event X | bind to vCPU1 <vCPU descheduled> | unmask event X | | receive event X | | handle_edge_irq(X) handle_edge_irq(X) | -> handle_irq_event() | -> set IRQD_IN_PROGRESS -> set IRQS_PENDING | | -> evtchn_interrupt() | -> clear IRQD_IN_PROGRESS | -> IRQS_PENDING is set | -> handle_irq_event() | -> evtchn_interrupt() | -> WARN() |
All the lateeoi handlers expect a ONESHOT semantic and evtchn_interrupt() is doesn't tolerate any deviation.
I think the problem was introduced by 7f874a0447a9 ("xen/events: fix lateeoi irq acknowledgment") because the interrupt was disabled previously. Therefore we wouldn't do another iteration in handle_edge_irq().
I think you picked the wrong commit for blaming, as this is just the last patch of the three patches you were testing.
Aside the handlers, I think it may impact the defer EOI mitigation because in theory if a 3rd vCPU is joining the party (let say vCPU A migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, irq_epoch, eoi_time} could possibly get mangled?
For a fix, we may want to consider to hold evtchn_rwlock with the write permission. Although, I am not 100% sure this is going to prevent everything.
It will make things worse, as it would violate the locking hierarchy (xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).
On a first glance I think we'll need a 3rd masking state ("temporarily masked") in the second patch in order to avoid a race with lateeoi.
In order to avoid the race you outlined above we need an "event is being handled" indicator checked via test_and_set() semantics in handle_irq_for_port() and reset only when calling clear_evtchn().
Does my write-up make sense to you?
Yes. What about my reply? ;-)
Juergen
Hi Juergen,
On 07/02/2021 12:58, Jürgen Groß wrote:
On 06.02.21 19:46, Julien Grall wrote:
Hi Juergen,
On 06/02/2021 10:49, Juergen Gross wrote:
The first three patches are fixes for XSA-332. The avoid WARN splats and a performance issue with interdomain events.
Thanks for helping to figure out the problem. Unfortunately, I still see reliably the WARN splat with the latest Linux master (1e0d27fce010) + your first 3 patches.
I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L events ABI.
After some debugging, I think I have an idea what's went wrong. The problem happens when the event is initially bound from vCPU0 to a different vCPU.
From the comment in xen_rebind_evtchn_to_cpu(), we are masking the event to prevent it being delivered on an unexpected vCPU. However, I believe the following can happen:
vCPU0 | vCPU1 | | Call xen_rebind_evtchn_to_cpu() receive event X | | mask event X | bind to vCPU1 <vCPU descheduled> | unmask event X | | receive event X | | handle_edge_irq(X) handle_edge_irq(X) | -> handle_irq_event() | -> set IRQD_IN_PROGRESS -> set IRQS_PENDING | | -> evtchn_interrupt() | -> clear IRQD_IN_PROGRESS | -> IRQS_PENDING is set | -> handle_irq_event() | -> evtchn_interrupt() | -> WARN() |
All the lateeoi handlers expect a ONESHOT semantic and evtchn_interrupt() is doesn't tolerate any deviation.
I think the problem was introduced by 7f874a0447a9 ("xen/events: fix lateeoi irq acknowledgment") because the interrupt was disabled previously. Therefore we wouldn't do another iteration in handle_edge_irq().
I think you picked the wrong commit for blaming, as this is just the last patch of the three patches you were testing.
I actually found the right commit for blaming but I copied the information from the wrong shell :/. The bug was introduced by:
c44b849cee8c ("xen/events: switch user event channels to lateeoi model")
Aside the handlers, I think it may impact the defer EOI mitigation because in theory if a 3rd vCPU is joining the party (let say vCPU A migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, irq_epoch, eoi_time} could possibly get mangled?
For a fix, we may want to consider to hold evtchn_rwlock with the write permission. Although, I am not 100% sure this is going to prevent everything.
It will make things worse, as it would violate the locking hierarchy (xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).
Ah, right.
On a first glance I think we'll need a 3rd masking state ("temporarily masked") in the second patch in order to avoid a race with lateeoi.
In order to avoid the race you outlined above we need an "event is being handled" indicator checked via test_and_set() semantics in handle_irq_for_port() and reset only when calling clear_evtchn().
It feels like we are trying to workaround the IRQ flow we are using (i.e. handle_edge_irq()).
This reminds me the thread we had before discovering XSA-332 (see [1]). Back then, it was suggested to switch back to handle_fasteoi_irq().
Cheers,
[1] https://lore.kernel.org/xen-devel/alpine.DEB.2.21.2004271552430.29217@sstabe...
On 08.02.21 10:11, Julien Grall wrote:
Hi Juergen,
On 07/02/2021 12:58, Jürgen Groß wrote:
On 06.02.21 19:46, Julien Grall wrote:
Hi Juergen,
On 06/02/2021 10:49, Juergen Gross wrote:
The first three patches are fixes for XSA-332. The avoid WARN splats and a performance issue with interdomain events.
Thanks for helping to figure out the problem. Unfortunately, I still see reliably the WARN splat with the latest Linux master (1e0d27fce010) + your first 3 patches.
I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L events ABI.
After some debugging, I think I have an idea what's went wrong. The problem happens when the event is initially bound from vCPU0 to a different vCPU.
From the comment in xen_rebind_evtchn_to_cpu(), we are masking the event to prevent it being delivered on an unexpected vCPU. However, I believe the following can happen:
vCPU0 | vCPU1 | | Call xen_rebind_evtchn_to_cpu() receive event X | | mask event X | bind to vCPU1 <vCPU descheduled> | unmask event X | | receive event X | | handle_edge_irq(X) handle_edge_irq(X) | -> handle_irq_event() | -> set IRQD_IN_PROGRESS -> set IRQS_PENDING | | -> evtchn_interrupt() | -> clear IRQD_IN_PROGRESS | -> IRQS_PENDING is set | -> handle_irq_event() | -> evtchn_interrupt() | -> WARN() |
All the lateeoi handlers expect a ONESHOT semantic and evtchn_interrupt() is doesn't tolerate any deviation.
I think the problem was introduced by 7f874a0447a9 ("xen/events: fix lateeoi irq acknowledgment") because the interrupt was disabled previously. Therefore we wouldn't do another iteration in handle_edge_irq().
I think you picked the wrong commit for blaming, as this is just the last patch of the three patches you were testing.
I actually found the right commit for blaming but I copied the information from the wrong shell :/. The bug was introduced by:
c44b849cee8c ("xen/events: switch user event channels to lateeoi model")
Aside the handlers, I think it may impact the defer EOI mitigation because in theory if a 3rd vCPU is joining the party (let say vCPU A migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, irq_epoch, eoi_time} could possibly get mangled?
For a fix, we may want to consider to hold evtchn_rwlock with the write permission. Although, I am not 100% sure this is going to prevent everything.
It will make things worse, as it would violate the locking hierarchy (xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).
Ah, right.
On a first glance I think we'll need a 3rd masking state ("temporarily masked") in the second patch in order to avoid a race with lateeoi.
In order to avoid the race you outlined above we need an "event is being handled" indicator checked via test_and_set() semantics in handle_irq_for_port() and reset only when calling clear_evtchn().
It feels like we are trying to workaround the IRQ flow we are using (i.e. handle_edge_irq()).
I'm not really sure this is the main problem here. According to your analysis the main problem is occurring when handling the event, not when handling the IRQ: the event is being received on two vcpus.
Our problem isn't due to the IRQ still being pending, but due it being raised again, which should happen for a one shot IRQ the same way.
But maybe I'm misunderstanding your idea.
Juergen
On 08/02/2021 09:41, Jürgen Groß wrote:
On 08.02.21 10:11, Julien Grall wrote:
Hi Juergen,
On 07/02/2021 12:58, Jürgen Groß wrote:
On 06.02.21 19:46, Julien Grall wrote:
Hi Juergen,
On 06/02/2021 10:49, Juergen Gross wrote:
The first three patches are fixes for XSA-332. The avoid WARN splats and a performance issue with interdomain events.
Thanks for helping to figure out the problem. Unfortunately, I still see reliably the WARN splat with the latest Linux master (1e0d27fce010) + your first 3 patches.
I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L events ABI.
After some debugging, I think I have an idea what's went wrong. The problem happens when the event is initially bound from vCPU0 to a different vCPU.
From the comment in xen_rebind_evtchn_to_cpu(), we are masking the event to prevent it being delivered on an unexpected vCPU. However, I believe the following can happen:
vCPU0 | vCPU1 | | Call xen_rebind_evtchn_to_cpu() receive event X | | mask event X | bind to vCPU1 <vCPU descheduled> | unmask event X | | receive event X | | handle_edge_irq(X) handle_edge_irq(X) | -> handle_irq_event() | -> set IRQD_IN_PROGRESS -> set IRQS_PENDING | | -> evtchn_interrupt() | -> clear IRQD_IN_PROGRESS | -> IRQS_PENDING is set | -> handle_irq_event() | -> evtchn_interrupt() | -> WARN() |
All the lateeoi handlers expect a ONESHOT semantic and evtchn_interrupt() is doesn't tolerate any deviation.
I think the problem was introduced by 7f874a0447a9 ("xen/events: fix lateeoi irq acknowledgment") because the interrupt was disabled previously. Therefore we wouldn't do another iteration in handle_edge_irq().
I think you picked the wrong commit for blaming, as this is just the last patch of the three patches you were testing.
I actually found the right commit for blaming but I copied the information from the wrong shell :/. The bug was introduced by:
c44b849cee8c ("xen/events: switch user event channels to lateeoi model")
Aside the handlers, I think it may impact the defer EOI mitigation because in theory if a 3rd vCPU is joining the party (let say vCPU A migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, irq_epoch, eoi_time} could possibly get mangled?
For a fix, we may want to consider to hold evtchn_rwlock with the write permission. Although, I am not 100% sure this is going to prevent everything.
It will make things worse, as it would violate the locking hierarchy (xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).
Ah, right.
On a first glance I think we'll need a 3rd masking state ("temporarily masked") in the second patch in order to avoid a race with lateeoi.
In order to avoid the race you outlined above we need an "event is being handled" indicator checked via test_and_set() semantics in handle_irq_for_port() and reset only when calling clear_evtchn().
It feels like we are trying to workaround the IRQ flow we are using (i.e. handle_edge_irq()).
I'm not really sure this is the main problem here. According to your analysis the main problem is occurring when handling the event, not when handling the IRQ: the event is being received on two vcpus.
I don't think we can easily divide the two because we rely on the IRQ framework to handle the lifecycle of the event. So...
Our problem isn't due to the IRQ still being pending, but due it being raised again, which should happen for a one shot IRQ the same way.
... I don't really see how the difference matter here. The idea is to re-use what's already existing rather than trying to re-invent the wheel with an extra lock (or whatever we can come up).
Cheers,
On 08.02.21 10:54, Julien Grall wrote:
On 08/02/2021 09:41, Jürgen Groß wrote:
On 08.02.21 10:11, Julien Grall wrote:
Hi Juergen,
On 07/02/2021 12:58, Jürgen Groß wrote:
On 06.02.21 19:46, Julien Grall wrote:
Hi Juergen,
On 06/02/2021 10:49, Juergen Gross wrote:
The first three patches are fixes for XSA-332. The avoid WARN splats and a performance issue with interdomain events.
Thanks for helping to figure out the problem. Unfortunately, I still see reliably the WARN splat with the latest Linux master (1e0d27fce010) + your first 3 patches.
I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L events ABI.
After some debugging, I think I have an idea what's went wrong. The problem happens when the event is initially bound from vCPU0 to a different vCPU.
From the comment in xen_rebind_evtchn_to_cpu(), we are masking the event to prevent it being delivered on an unexpected vCPU. However, I believe the following can happen:
vCPU0 | vCPU1 | | Call xen_rebind_evtchn_to_cpu() receive event X | | mask event X | bind to vCPU1 <vCPU descheduled> | unmask event X | | receive event X | | handle_edge_irq(X) handle_edge_irq(X) | -> handle_irq_event() | -> set IRQD_IN_PROGRESS -> set IRQS_PENDING | | -> evtchn_interrupt() | -> clear IRQD_IN_PROGRESS | -> IRQS_PENDING is set | -> handle_irq_event() | -> evtchn_interrupt() | -> WARN() |
All the lateeoi handlers expect a ONESHOT semantic and evtchn_interrupt() is doesn't tolerate any deviation.
I think the problem was introduced by 7f874a0447a9 ("xen/events: fix lateeoi irq acknowledgment") because the interrupt was disabled previously. Therefore we wouldn't do another iteration in handle_edge_irq().
I think you picked the wrong commit for blaming, as this is just the last patch of the three patches you were testing.
I actually found the right commit for blaming but I copied the information from the wrong shell :/. The bug was introduced by:
c44b849cee8c ("xen/events: switch user event channels to lateeoi model")
Aside the handlers, I think it may impact the defer EOI mitigation because in theory if a 3rd vCPU is joining the party (let say vCPU A migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, irq_epoch, eoi_time} could possibly get mangled?
For a fix, we may want to consider to hold evtchn_rwlock with the write permission. Although, I am not 100% sure this is going to prevent everything.
It will make things worse, as it would violate the locking hierarchy (xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held).
Ah, right.
On a first glance I think we'll need a 3rd masking state ("temporarily masked") in the second patch in order to avoid a race with lateeoi.
In order to avoid the race you outlined above we need an "event is being handled" indicator checked via test_and_set() semantics in handle_irq_for_port() and reset only when calling clear_evtchn().
It feels like we are trying to workaround the IRQ flow we are using (i.e. handle_edge_irq()).
I'm not really sure this is the main problem here. According to your analysis the main problem is occurring when handling the event, not when handling the IRQ: the event is being received on two vcpus.
I don't think we can easily divide the two because we rely on the IRQ framework to handle the lifecycle of the event. So...
Our problem isn't due to the IRQ still being pending, but due it being raised again, which should happen for a one shot IRQ the same way.
... I don't really see how the difference matter here. The idea is to re-use what's already existing rather than trying to re-invent the wheel with an extra lock (or whatever we can come up).
The difference is that the race is occurring _before_ any IRQ is involved. So I don't see how modification of IRQ handling would help.
Juergen
Hi Juergen,
On 08/02/2021 10:22, Jürgen Groß wrote:
On 08.02.21 10:54, Julien Grall wrote:
... I don't really see how the difference matter here. The idea is to re-use what's already existing rather than trying to re-invent the wheel with an extra lock (or whatever we can come up).
The difference is that the race is occurring _before_ any IRQ is involved. So I don't see how modification of IRQ handling would help.
Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:
if ( irq in progress ) { set IRQS_PENDING return; }
do { clear IRQS_PENDING handle_irq() } while (IRQS_PENDING is set)
IRQ handling flow like handle_fasteoi_irq() looks like:
if ( irq in progress ) return;
handle_irq()
The latter flow would catch "spurious" interrupt and ignore them. So it would handle nicely the race when changing the event affinity.
Cheers,
On 08.02.21 11:40, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 10:22, Jürgen Groß wrote:
On 08.02.21 10:54, Julien Grall wrote:
... I don't really see how the difference matter here. The idea is to re-use what's already existing rather than trying to re-invent the wheel with an extra lock (or whatever we can come up).
The difference is that the race is occurring _before_ any IRQ is involved. So I don't see how modification of IRQ handling would help.
Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:
if ( irq in progress ) { set IRQS_PENDING return; }
do { clear IRQS_PENDING handle_irq() } while (IRQS_PENDING is set)
IRQ handling flow like handle_fasteoi_irq() looks like:
if ( irq in progress ) return;
handle_irq()
The latter flow would catch "spurious" interrupt and ignore them. So it would handle nicely the race when changing the event affinity.
Sure? Isn't "irq in progress" being reset way before our "lateeoi" is issued, thus having the same problem again? And I think we want to keep the lateeoi behavior in order to be able to control event storms.
Juergen
On 08/02/2021 12:14, Jürgen Groß wrote:
On 08.02.21 11:40, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 10:22, Jürgen Groß wrote:
On 08.02.21 10:54, Julien Grall wrote:
... I don't really see how the difference matter here. The idea is to re-use what's already existing rather than trying to re-invent the wheel with an extra lock (or whatever we can come up).
The difference is that the race is occurring _before_ any IRQ is involved. So I don't see how modification of IRQ handling would help.
Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:
if ( irq in progress ) { set IRQS_PENDING return; }
do { clear IRQS_PENDING handle_irq() } while (IRQS_PENDING is set)
IRQ handling flow like handle_fasteoi_irq() looks like:
if ( irq in progress ) return;
handle_irq()
The latter flow would catch "spurious" interrupt and ignore them. So it would handle nicely the race when changing the event affinity.
Sure? Isn't "irq in progress" being reset way before our "lateeoi" is issued, thus having the same problem again?
Sorry I can't parse this.
And I think we want to keep
the lateeoi behavior in order to be able to control event storms.
I didn't (yet) suggest to remove lateeoi. I only suggest to use a different workflow to handle the race with vCPU affinity.
Cheers,
On 08.02.21 13:16, Julien Grall wrote:
On 08/02/2021 12:14, Jürgen Groß wrote:
On 08.02.21 11:40, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 10:22, Jürgen Groß wrote:
On 08.02.21 10:54, Julien Grall wrote:
... I don't really see how the difference matter here. The idea is to re-use what's already existing rather than trying to re-invent the wheel with an extra lock (or whatever we can come up).
The difference is that the race is occurring _before_ any IRQ is involved. So I don't see how modification of IRQ handling would help.
Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:
if ( irq in progress ) { set IRQS_PENDING return; }
do { clear IRQS_PENDING handle_irq() } while (IRQS_PENDING is set)
IRQ handling flow like handle_fasteoi_irq() looks like:
if ( irq in progress ) return;
handle_irq()
The latter flow would catch "spurious" interrupt and ignore them. So it would handle nicely the race when changing the event affinity.
Sure? Isn't "irq in progress" being reset way before our "lateeoi" is issued, thus having the same problem again?
Sorry I can't parse this.
handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is this condition being reset again in order to be able to process another IRQ? I believe this will be the case before our "lateeoi" handling is becoming active (more precise: when our IRQ handler is returning to handle_fasteoi_irq()), resulting in the possibility of the same race we are experiencing now.
Juergen
Hi Juergen,
On 08/02/2021 12:31, Jürgen Groß wrote:
On 08.02.21 13:16, Julien Grall wrote:
On 08/02/2021 12:14, Jürgen Groß wrote:
On 08.02.21 11:40, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 10:22, Jürgen Groß wrote:
On 08.02.21 10:54, Julien Grall wrote:
... I don't really see how the difference matter here. The idea is to re-use what's already existing rather than trying to re-invent the wheel with an extra lock (or whatever we can come up).
The difference is that the race is occurring _before_ any IRQ is involved. So I don't see how modification of IRQ handling would help.
Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:
if ( irq in progress ) { set IRQS_PENDING return; }
do { clear IRQS_PENDING handle_irq() } while (IRQS_PENDING is set)
IRQ handling flow like handle_fasteoi_irq() looks like:
if ( irq in progress ) return;
handle_irq()
The latter flow would catch "spurious" interrupt and ignore them. So it would handle nicely the race when changing the event affinity.
Sure? Isn't "irq in progress" being reset way before our "lateeoi" is issued, thus having the same problem again?
Sorry I can't parse this.
handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is this condition being reset again in order to be able to process another IRQ?
It is reset after the handler has been called. See handle_irq_event().
I believe this will be the case before our "lateeoi" handling is becoming active (more precise: when our IRQ handler is returning to handle_fasteoi_irq()), resulting in the possibility of the same race we are experiencing now.
I am a bit confused what you mean by "lateeoi" handling is becoming active. Can you clarify?
Note that are are other IRQ flows existing. We should have a look at them before trying to fix thing ourself.
Although, the other issue I can see so far is handle_irq_for_port() will update info->{eoi_cpu, irq_epoch, eoi_time} without any locking. But it is not clear this is what you mean by "becoming active".
Cheers,
On 08.02.21 14:09, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 12:31, Jürgen Groß wrote:
On 08.02.21 13:16, Julien Grall wrote:
On 08/02/2021 12:14, Jürgen Groß wrote:
On 08.02.21 11:40, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 10:22, Jürgen Groß wrote:
On 08.02.21 10:54, Julien Grall wrote: > ... I don't really see how the difference matter here. The idea > is to re-use what's already existing rather than trying to > re-invent the wheel with an extra lock (or whatever we can come up).
The difference is that the race is occurring _before_ any IRQ is involved. So I don't see how modification of IRQ handling would help.
Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:
if ( irq in progress ) { set IRQS_PENDING return; }
do { clear IRQS_PENDING handle_irq() } while (IRQS_PENDING is set)
IRQ handling flow like handle_fasteoi_irq() looks like:
if ( irq in progress ) return;
handle_irq()
The latter flow would catch "spurious" interrupt and ignore them. So it would handle nicely the race when changing the event affinity.
Sure? Isn't "irq in progress" being reset way before our "lateeoi" is issued, thus having the same problem again?
Sorry I can't parse this.
handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is this condition being reset again in order to be able to process another IRQ?
It is reset after the handler has been called. See handle_irq_event().
Right. And for us this is too early, as we want the next IRQ being handled only after we have called xen_irq_lateeoi().
I believe this will be the case before our "lateeoi" handling is becoming active (more precise: when our IRQ handler is returning to handle_fasteoi_irq()), resulting in the possibility of the same race we are experiencing now.
I am a bit confused what you mean by "lateeoi" handling is becoming active. Can you clarify?
See above: the next call of the handler should be allowed only after xen_irq_lateeoi() for the IRQ has been called.
If the handler is being called earlier we have the race resulting in the WARN() splats.
Note that are are other IRQ flows existing. We should have a look at them before trying to fix thing ourself.
Fine with me, but it either needs to fit all use cases (interdomain, IPI, real interrupts) or we need to have a per-type IRQ flow.
I think we should fix the issue locally first, then we can start to do a thorough rework planning. Its not as if the needed changes with the current flow would be so huge, and I'd really like to have a solution rather sooner than later. Changing the IRQ flow might have other side effects which need to be excluded by thorough testing.
Although, the other issue I can see so far is handle_irq_for_port() will update info->{eoi_cpu, irq_epoch, eoi_time} without any locking. But it is not clear this is what you mean by "becoming active".
As long as a single event can't be handled on multiple cpus at the same time, there is no locking needed.
Juergen
Hi Juergen,
On 08/02/2021 13:58, Jürgen Groß wrote:
On 08.02.21 14:09, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 12:31, Jürgen Groß wrote:
On 08.02.21 13:16, Julien Grall wrote:
On 08/02/2021 12:14, Jürgen Groß wrote:
On 08.02.21 11:40, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 10:22, Jürgen Groß wrote: > On 08.02.21 10:54, Julien Grall wrote: >> ... I don't really see how the difference matter here. The idea >> is to re-use what's already existing rather than trying to >> re-invent the wheel with an extra lock (or whatever we can come >> up). > > The difference is that the race is occurring _before_ any IRQ is > involved. So I don't see how modification of IRQ handling would > help.
Roughly our current IRQ handling flow (handle_eoi_irq()) looks like:
if ( irq in progress ) { set IRQS_PENDING return; }
do { clear IRQS_PENDING handle_irq() } while (IRQS_PENDING is set)
IRQ handling flow like handle_fasteoi_irq() looks like:
if ( irq in progress ) return;
handle_irq()
The latter flow would catch "spurious" interrupt and ignore them. So it would handle nicely the race when changing the event affinity.
Sure? Isn't "irq in progress" being reset way before our "lateeoi" is issued, thus having the same problem again?
Sorry I can't parse this.
handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is this condition being reset again in order to be able to process another IRQ?
It is reset after the handler has been called. See handle_irq_event().
Right. And for us this is too early, as we want the next IRQ being handled only after we have called xen_irq_lateeoi().
It is not really the next IRQ here. It is more a spurious IRQ because we don't clear & mask the event right away. Instead, it is done later in the handling.
I believe this will be the case before our "lateeoi" handling is becoming active (more precise: when our IRQ handler is returning to handle_fasteoi_irq()), resulting in the possibility of the same race we are experiencing now.
I am a bit confused what you mean by "lateeoi" handling is becoming active. Can you clarify?
See above: the next call of the handler should be allowed only after xen_irq_lateeoi() for the IRQ has been called.
If the handler is being called earlier we have the race resulting in the WARN() splats.
I feel it is dislike to understand race with just words. Can you provide a scenario (similar to the one I originally provided) with two vCPUs and show how this can happen?
Note that are are other IRQ flows existing. We should have a look at them before trying to fix thing ourself.
Fine with me, but it either needs to fit all use cases (interdomain, IPI, real interrupts) or we need to have a per-type IRQ flow.
AFAICT, we already used different flow based on the use cases. Before 2011, we used to use the fasteoi one but this was changed by the following commit:
commit 7e186bdd0098b34c69fb8067c67340ae610ea499 Author: Stefano Stabellini stefano.stabellini@eu.citrix.com Date: Fri May 6 12:27:50 2011 +0100
xen: do not clear and mask evtchns in __xen_evtchn_do_upcall
Change the irq handler of evtchns and pirqs that don't need EOI (pirqs that correspond to physical edge interrupts) to handle_edge_irq.
Use handle_fasteoi_irq for pirqs that need eoi (they generally correspond to level triggered irqs), no risk in loosing interrupts because we have to EOI the irq anyway.
This change has the following benefits:
- it uses the very same handlers that Linux would use on native for the same irqs (handle_edge_irq for edge irqs and msis, and handle_fasteoi_irq for everything else);
- it uses these handlers in the same way native code would use them: it let Linux mask\unmask and ack the irq when Linux want to mask\unmask and ack the irq;
- it fixes a problem occurring when a driver calls disable_irq() in its handler: the old code was unconditionally unmasking the evtchn even if the irq is disabled when irq_eoi was called.
See Documentation/DocBook/genericirq.tmpl for more informations.
Signed-off-by: Stefano Stabellini stefano.stabellini@eu.citrix.com [v1: Fixed space/tab issues] Signed-off-by: Konrad Rzeszutek Wilk konrad.wilk@oracle.com
I think we should fix the issue locally first, then we can start to do a thorough rework planning. Its not as if the needed changes with the current flow would be so huge, and I'd really like to have a solution rather sooner than later. Changing the IRQ flow might have other side effects which need to be excluded by thorough testing.
I agree that we need a solution ASAP. But I am a bit worry to: 1) Add another lock in that event handling path. 2) Add more complexity in the event handling (it is already fairly difficult to reason about the locking/race)
Let see what the local fix look like.
Although, the other issue I can see so far is handle_irq_for_port() will update info->{eoi_cpu, irq_epoch, eoi_time} without any locking. But it is not clear this is what you mean by "becoming active".
As long as a single event can't be handled on multiple cpus at the same time, there is no locking needed.
Well, it can happen in the current code (see my original scenario). If your idea fix it then fine.
Cheers,
On 08/02/2021 14:20, Julien Grall wrote:
I believe this will be the case before our "lateeoi" handling is becoming active (more precise: when our IRQ handler is returning to handle_fasteoi_irq()), resulting in the possibility of the same race we are experiencing now.
I am a bit confused what you mean by "lateeoi" handling is becoming active. Can you clarify?
See above: the next call of the handler should be allowed only after xen_irq_lateeoi() for the IRQ has been called.
If the handler is being called earlier we have the race resulting in the WARN() splats.
I feel it is dislike to understand race with just words. Can you provide
Sorry I meant difficult rather than dislike.
Cheers,
On 08.02.21 15:20, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 13:58, Jürgen Groß wrote:
On 08.02.21 14:09, Julien Grall wrote:
Hi Juergen,
On 08/02/2021 12:31, Jürgen Groß wrote:
On 08.02.21 13:16, Julien Grall wrote:
On 08/02/2021 12:14, Jürgen Groß wrote:
On 08.02.21 11:40, Julien Grall wrote: > Hi Juergen, > > On 08/02/2021 10:22, Jürgen Groß wrote: >> On 08.02.21 10:54, Julien Grall wrote: >>> ... I don't really see how the difference matter here. The idea >>> is to re-use what's already existing rather than trying to >>> re-invent the wheel with an extra lock (or whatever we can come >>> up). >> >> The difference is that the race is occurring _before_ any IRQ is >> involved. So I don't see how modification of IRQ handling would >> help. > > Roughly our current IRQ handling flow (handle_eoi_irq()) looks like: > > if ( irq in progress ) > { > set IRQS_PENDING > return; > } > > do > { > clear IRQS_PENDING > handle_irq() > } while (IRQS_PENDING is set) > > IRQ handling flow like handle_fasteoi_irq() looks like: > > if ( irq in progress ) > return; > > handle_irq() > > The latter flow would catch "spurious" interrupt and ignore them. > So it would handle nicely the race when changing the event affinity.
Sure? Isn't "irq in progress" being reset way before our "lateeoi" is issued, thus having the same problem again?
Sorry I can't parse this.
handle_fasteoi_irq() will do nothing "if ( irq in progress )". When is this condition being reset again in order to be able to process another IRQ?
It is reset after the handler has been called. See handle_irq_event().
Right. And for us this is too early, as we want the next IRQ being handled only after we have called xen_irq_lateeoi().
It is not really the next IRQ here. It is more a spurious IRQ because we don't clear & mask the event right away. Instead, it is done later in the handling.
I believe this will be the case before our "lateeoi" handling is becoming active (more precise: when our IRQ handler is returning to handle_fasteoi_irq()), resulting in the possibility of the same race we are experiencing now.
I am a bit confused what you mean by "lateeoi" handling is becoming active. Can you clarify?
See above: the next call of the handler should be allowed only after xen_irq_lateeoi() for the IRQ has been called.
If the handler is being called earlier we have the race resulting in the WARN() splats.
I feel it is dislike to understand race with just words. Can you provide a scenario (similar to the one I originally provided) with two vCPUs and show how this can happen?
vCPU0 | vCPU1 | | Call xen_rebind_evtchn_to_cpu() receive event X | | mask event X | bind to vCPU1 <vCPU descheduled> | unmask event X | | receive event X | | handle_fasteoi_irq(X) | -> handle_irq_event() | -> set IRQD_IN_PROGRESS | -> evtchn_interrupt() | -> evtchn->enabled = false | -> clear IRQD_IN_PROGRESS handle_fasteoi_irq(X)| -> evtchn_interrupt()| -> WARN() | | xen_irq_lateeoi(X)
Note that are are other IRQ flows existing. We should have a look at them before trying to fix thing ourself.
Fine with me, but it either needs to fit all use cases (interdomain, IPI, real interrupts) or we need to have a per-type IRQ flow.
AFAICT, we already used different flow based on the use cases. Before 2011, we used to use the fasteoi one but this was changed by the following commit:
Yes, I know that.
I think we should fix the issue locally first, then we can start to do a thorough rework planning. Its not as if the needed changes with the current flow would be so huge, and I'd really like to have a solution rather sooner than later. Changing the IRQ flow might have other side effects which need to be excluded by thorough testing.
I agree that we need a solution ASAP. But I am a bit worry to: 1) Add another lock in that event handling path.
Regarding complexity: it is very simple (just around masking/unmasking of the event channel). Contention is very unlikely.
2) Add more complexity in the event handling (it is already fairly difficult to reason about the locking/race)
Let see what the local fix look like.
Yes.
Although, the other issue I can see so far is handle_irq_for_port() will update info->{eoi_cpu, irq_epoch, eoi_time} without any locking. But it is not clear this is what you mean by "becoming active".
As long as a single event can't be handled on multiple cpus at the same time, there is no locking needed.
Well, it can happen in the current code (see my original scenario). If your idea fix it then fine.
I hope so.
Juergen
linux-stable-mirror@lists.linaro.org