Discussion:
[gem5-dev] o3 cpu: possible bug in interrupt handling
Nilay
2013-03-21 03:54:15 UTC
Permalink
Hi

While testing some patches for the x86 architecture, I came across a
problem in which the system does nothing for several seconds. This time is
the time of the target machine. This behavior is accompanied by the
following message on the console for target machine --

hda: dma_timer_expiry: dma status == 0x64
hda: DMA interrupt recovery
hda: lost interrupt


Joel Hestness, who had seen the problem before, provided me with a patch
that solves the problem. From my conversation with Joel and after looking
at the code my self, it appears that the problem is with the fact that the
commit stage of the pipeline keeps a local copy of the interrupt object.
Since the interrupt is usually handled several cycles after the commit
stage becomes aware of it, it is possible that the local copy of the
interrupt object may not be the correct interrupt when the interrupt is
actually handled. It is possible that another interrupt occurred in the
interval between interrupt detection and interrupt handling. I am
proposing the following solution (slightly different from Joel's proposal)
to handle this problem:


diff --git a/src/cpu/o3/commit_impl.hh b/src/cpu/o3/commit_impl.hh
--- a/src/cpu/o3/commit_impl.hh
+++ b/src/cpu/o3/commit_impl.hh
@@ -753,7 +753,7 @@
}

// CPU will handle interrupt.
- cpu->processInterrupts(interrupt);
+ cpu->processInterrupts(cpu->getInterrupts());

thread[0]->noSquashFromTC = false;


The code above ignores the local copy of the interrupt object and fetches
a new one from the CPU object.

There are several questions that need to be addressed here. Is there an
actual bug in the o3 cpu? Is the diagnosis correct? Is the solution
acceptable?


Thanks
Nilay
Ali Saidi
2013-03-21 14:49:25 UTC
Permalink
Hi Nilay,

I'm a little surprised at this issue. I could see that
another higher priority interrupt comes along, but if that is the case
it should be taken after the lower priority interrupt (nearly
immediately). It could be considered a bug, but on the other hand the
CPU has to decide what interrupt it is going to take at some point and
act on it, so perhaps not.

I'm guessing something like the following
happens:

assert(low_priority);

interrupt = getInterrupt(); //
low_priority

assert(high_priority);

handle_interrupt(); // just
clears the highest priority interrupt set??

I suppose I don't have any
issues with your fix, provided there is a nice comment describing
what/why. I'd like to see if we are in fact doing the above.

Thanks,


Ali
Post by Nilay
Hi
While testing some
patches for the x86 architecture, I came across a
Post by Nilay
problem in which the
system does nothing for several seconds. This time is
Post by Nilay
the time of the
target machine. This behavior is accompanied by the
Post by Nilay
following message
on the console for target machine --
Post by Nilay
hda: dma_timer_expiry: dma
status == 0x64
Post by Nilay
hda: DMA interrupt recovery
hda: lost interrupt
Joel Hestness, who had seen the problem before, provided me with a
patch
Post by Nilay
that solves the problem. From my conversation with Joel and
after looking
Post by Nilay
at the code my self, it appears that the problem is with
the fact that the
Post by Nilay
commit stage of the pipeline keeps a local copy of
the interrupt object.
Post by Nilay
Since the interrupt is usually handled several
cycles after the commit
Post by Nilay
stage becomes aware of it, it is possible that
the local copy of the
Post by Nilay
interrupt object may not be the correct
interrupt when the interrupt is
Post by Nilay
actually handled. It is possible that
another interrupt occurred in the
Post by Nilay
interval between interrupt detection
and interrupt handling. I am
Post by Nilay
proposing the following solution
(slightly different from Joel's proposal)
diff --git a/src/cpu/o3/commit_impl.hh b/src/cpu/o3/commit_impl.hh
Post by Nilay
---
a/src/cpu/o3/commit_impl.hh
Post by Nilay
+++ b/src/cpu/o3/commit_impl.hh
@@
}
// CPU will handle interrupt.
-
cpu->processInterrupts(interrupt);
Post by Nilay
+
cpu->processInterrupts(cpu->getInterrupts());
thread[0]->noSquashFromTC = false;
Post by Nilay
The code above ignores the local
copy of the interrupt object and fetches
Post by Nilay
a new one from the CPU
object.
Post by Nilay
There are several questions that need to be addressed here.
Is there an
Post by Nilay
actual bug in the o3 cpu? Is the diagnosis correct? Is the
solution
Post by Nilay
acceptable?
Thanks
Nilay
_______________________________________________
Post by Nilay
gem5-dev mailing
list
Post by Nilay
http://m5sim.org/mailman/listinfo/gem5-dev
[1]



Links:
------
[1] http://m5sim.org/mailman/listinfo/gem5-dev
Joel Hestness
2013-03-21 16:33:56 UTC
Permalink
Hey guys,
Ali's assessment is correct, namely, that higher priority interrupts can
be dropped if the x86 interrupts device receives a low priority interrupt
followed by a high priority interrupt before either is serviced. From what
I recall, this problem is unique to x86, because it is the only
architecture gem5 supports that allows multiple concurrent interrupts.
This means that you can consider servicing interrupts out of order - in
this case, servicing the higher priority interrupt first - unlike other
architectures, which will service the interrupts in FIFO order.

@Nilay:
Your solution is definitely a step in the right direction, but I think we
should go a step farther, because this will still instantiate and drop
fault object instances with the interrupt pointer. Yes, faults are
ref-counted, but we don't even need to instantiate them: Your solution
reduces the commit stage interrupt pointer to a bool which indicates
whether there is an interrupt pending. Why not replace the pointer with a
bool, say "interruptPending", which is used to signal whether an interrupt
is pending instead of the interrupt pointer being non-NULL for all gating
in the commit pipeline? Instead of calling getInterrupts() as we do
currently and having commit hang onto a fault instance with the interrupt
pointer, the O3CPU should call a different (new) interrupts device
interface function isInterruptPending(), which returns a bool if there is
an interrupt to service. This would greatly clarify the use of these
variables in the commit stage and it would reduce the risk of future code
modifications that mishandle the interrupt pointer. The obvious downside
of this solution is that it will require updating the interrupts devices
for every architecture, which may be a fair amount of work, and testing
these changes is very non-trivial.

Thoughts?
Joel
Post by Ali Saidi
Hi Nilay,
I'm a little surprised at this issue. I could see that
another higher priority interrupt comes along, but if that is the case
it should be taken after the lower priority interrupt (nearly
immediately). It could be considered a bug, but on the other hand the
CPU has to decide what interrupt it is going to take at some point and
act on it, so perhaps not.
I'm guessing something like the following
assert(low_priority);
interrupt = getInterrupt(); //
low_priority
assert(high_priority);
handle_interrupt(); // just
clears the highest priority interrupt set??
I suppose I don't have any
issues with your fix, provided there is a nice comment describing
what/why. I'd like to see if we are in fact doing the above.
Thanks,
Ali
Post by Nilay
Hi
While testing some
patches for the x86 architecture, I came across a
Post by Nilay
problem in which the
system does nothing for several seconds. This time is
Post by Nilay
the time of the
target machine. This behavior is accompanied by the
Post by Nilay
following message
on the console for target machine --
Post by Nilay
hda: dma_timer_expiry: dma
status == 0x64
Post by Nilay
hda: DMA interrupt recovery
hda: lost interrupt
Joel Hestness, who had seen the problem before, provided me with a patch
Post by Nilay
that solves the problem. From my conversation with Joel and
after looking
Post by Nilay
at the code my self, it appears that the problem is with
the fact that the
Post by Nilay
commit stage of the pipeline keeps a local copy of
the interrupt object.
Post by Nilay
Since the interrupt is usually handled several
cycles after the commit
Post by Nilay
stage becomes aware of it, it is possible that
the local copy of the
Post by Nilay
interrupt object may not be the correct
interrupt when the interrupt is
Post by Nilay
actually handled. It is possible that
another interrupt occurred in the
Post by Nilay
interval between interrupt detection
and interrupt handling. I am
Post by Nilay
proposing the following solution
(slightly different from Joel's proposal)
diff --git a/src/cpu/o3/commit_impl.hh b/src/cpu/o3/commit_impl.hh
Post by Nilay
---
a/src/cpu/o3/commit_impl.hh
Post by Nilay
+++ b/src/cpu/o3/commit_impl.hh
@@
}
// CPU will handle interrupt.
-
cpu->processInterrupts(interrupt);
Post by Nilay
+
cpu->processInterrupts(cpu->getInterrupts());
thread[0]->noSquashFromTC = false;
Post by Nilay
The code above ignores the local
copy of the interrupt object and fetches
Post by Nilay
a new one from the CPU
object.
Post by Nilay
There are several questions that need to be addressed here.
Is there an
Post by Nilay
actual bug in the o3 cpu? Is the diagnosis correct? Is the
solution
Post by Nilay
acceptable?
Thanks
Nilay
_______________________________________________
Post by Nilay
gem5-dev mailing
list
Post by Nilay
http://m5sim.org/mailman/listinfo/gem5-dev
[1]
------
[1] http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://www.cs.utexas.edu/~hestness
Nilay
2013-03-23 20:20:09 UTC
Permalink
Post by Joel Hestness
Hey guys,
Ali's assessment is correct, namely, that higher priority interrupts can
be dropped if the x86 interrupts device receives a low priority interrupt
followed by a high priority interrupt before either is serviced. From what
I recall, this problem is unique to x86, because it is the only
architecture gem5 supports that allows multiple concurrent interrupts.
This means that you can consider servicing interrupts out of order - in
this case, servicing the higher priority interrupt first - unlike other
architectures, which will service the interrupts in FIFO order.
Your solution is definitely a step in the right direction, but I think we
should go a step farther, because this will still instantiate and drop
fault object instances with the interrupt pointer. Yes, faults are
ref-counted, but we don't even need to instantiate them: Your solution
reduces the commit stage interrupt pointer to a bool which indicates
whether there is an interrupt pending. Why not replace the pointer with a
bool, say "interruptPending", which is used to signal whether an interrupt
is pending instead of the interrupt pointer being non-NULL for all gating
in the commit pipeline? Instead of calling getInterrupts() as we do
currently and having commit hang onto a fault instance with the interrupt
pointer, the O3CPU should call a different (new) interrupts device
interface function isInterruptPending(), which returns a bool if there is
an interrupt to service. This would greatly clarify the use of these
variables in the commit stage and it would reduce the risk of future code
modifications that mishandle the interrupt pointer. The obvious downside
of this solution is that it will require updating the interrupts devices
for every architecture, which may be a fair amount of work, and testing
these changes is very non-trivial.
Note that the function for checking whether or not interrupts are pending
already exists. It is called checkInterrupts(). But the suggested solution
will not work for the alpha architecture. This is because in alpha, the
functions checkInterrupts() and getInterrupts() do not always agree.


--
Nilay
Nilay
2013-03-25 21:52:53 UTC
Permalink
Ali, here is trace that shows the problem:

---------------------------------------------------------------------------
5103756467000: system.cpu.interrupts: Got Trigger Interrupt message with
vector 0x30. // Here is the interrupt object receives the interrupt
5103756467000: system.cpu.interrupts: Interrupt is an LowestPriority.
.
.
.
5103756496000: system.cpu.interrupts: Generated regular interrupt (48)
fault object. // Interrupt object created for the CPU
.
.
.
5103791689000: system.pc.south_bridge.ide.disks0: Posting Interrupt // IDE
controller raises the interrupt line
5103791740000: system.cpu.interrupts: Got Trigger Interrupt message with
vector 0x3e. // Interrupt object receives the interrupt from the IDE
controller
5103791740000: system.cpu.interrupts: Interrupt is an LowestPriority.
.
.
.
5103791744000: system.cpu.interrupts: Interrupt 62 sent to core. // CPU
asks for interrupt information to be updated i.e. some interrupt is being
handled.

---------------------------------------------------------------------------

The problem lies in the o3 cpu itself. In the following piece code from
cpu/o3/cpu.cc file, the cpu can possibly update the interrupt info for a
different interrupt and invoke the interrupt handler for a different
interrupt. This is what happened above. The interrupt object thinks that
interrupt 62 is being handled, while the CPU actually handled interrupt
48. This results in IDE controller waiting for about 20 seconds, before it
switches from DMA to PIO mode and redoes all the work.

template <class Impl>
void
FullO3CPU<Impl>::processInterrupts(Fault interrupt)
{
// Check for interrupts here. For now can copy the code that
// exists within isa_fullsys_traits.hh. Also assume that thread 0
// is the one that handles the interrupts.
// @todo: Possibly consolidate the interrupt checking code.
// @todo: Allow other threads to handle interrupts.

assert(interrupt != NoFault);
this->interrupts->updateIntrInfo(this->threadContexts[0]);

DPRINTF(O3CPU, "Interrupt %s being handled\n", interrupt->name());
this->trap(interrupt, 0, NULL);
}


--
Nilay
Ali Saidi
2013-04-15 01:37:17 UTC
Permalink
Hi Nilay,

Thanks for digging even further into this. Is Alpha the only one where checkInterrupts and getInterrupts don't agree? Perhaps we can just fix the Alpha implementation and that will solve the problem. It seems like in Alpha the checkInterrupts() function is just there to speed up execution when there isn't the possibility of an interrupt pending, but it doesn't handle all the tests. Does that work for everyone?

Thanks,
Aii
Post by Nilay
---------------------------------------------------------------------------
5103756467000: system.cpu.interrupts: Got Trigger Interrupt message with
vector 0x30. // Here is the interrupt object receives the interrupt
5103756467000: system.cpu.interrupts: Interrupt is an LowestPriority.
.
.
.
5103756496000: system.cpu.interrupts: Generated regular interrupt (48)
fault object. // Interrupt object created for the CPU
.
.
.
5103791689000: system.pc.south_bridge.ide.disks0: Posting Interrupt // IDE
controller raises the interrupt line
5103791740000: system.cpu.interrupts: Got Trigger Interrupt message with
vector 0x3e. // Interrupt object receives the interrupt from the IDE
controller
5103791740000: system.cpu.interrupts: Interrupt is an LowestPriority.
.
.
.
5103791744000: system.cpu.interrupts: Interrupt 62 sent to core. // CPU
asks for interrupt information to be updated i.e. some interrupt is being
handled.
---------------------------------------------------------------------------
The problem lies in the o3 cpu itself. In the following piece code from
cpu/o3/cpu.cc file, the cpu can possibly update the interrupt info for a
different interrupt and invoke the interrupt handler for a different
interrupt. This is what happened above. The interrupt object thinks that
interrupt 62 is being handled, while the CPU actually handled interrupt
48. This results in IDE controller waiting for about 20 seconds, before it
switches from DMA to PIO mode and redoes all the work.
template <class Impl>
void
FullO3CPU<Impl>::processInterrupts(Fault interrupt)
{
// Check for interrupts here. For now can copy the code that
// exists within isa_fullsys_traits.hh. Also assume that thread 0
// is the one that handles the interrupts.
assert(interrupt != NoFault);
this->interrupts->updateIntrInfo(this->threadContexts[0]);
DPRINTF(O3CPU, "Interrupt %s being handled\n", interrupt->name());
this->trap(interrupt, 0, NULL);
}
--
Nilay
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2013-04-16 21:56:03 UTC
Permalink
Post by Ali Saidi
Hi Nilay,
Thanks for digging even further into this. Is Alpha the only one where
checkInterrupts and getInterrupts don't agree? Perhaps we can just fix
the Alpha implementation and that will solve the problem. It seems like
in Alpha the checkInterrupts() function is just there to speed up
execution when there isn't the possibility of an interrupt pending, but
it doesn't handle all the tests. Does that work for everyone?
A cursory glance at functions for Mips and Sparc revealed that both of
them are doing different stuff in get() and check() functions.

--
Nilay
Ali Saidi
2013-04-27 14:24:03 UTC
Permalink
Post by Ali Saidi
Hi Nilay,
Thanks for digging even further into this. Is Alpha the only one where checkInterrupts and getInterrupts don't agree? Perhaps we can just fix the Alpha implementation and that will solve the problem. It seems like in Alpha the checkInterrupts() function is just there to speed up execution when there isn't the possibility of an interrupt pending, but it doesn't handle all the tests. Does that work for everyone?
A cursory glance at functions for Mips and Sparc revealed that both of them are doing different stuff in get() and check() functions.
Ok. We probably need to have 3 functions. possibleInterrupt(), checkInterrupt() and getInterrupt(). The implementation in checkInterrupt today can move into possibleInterrupt() and the current implementation of getInterrupt() would need to be refactored for all the architectures to either fully check or check and return. Thoughts?

Ali

Loading...