[gem5-dev] o3 cpu: possible bug in interrupt handling

Discussion:

Nilay

2013-03-21 03:54:15 UTC

Hi

While testing some patches for the x86 architecture, I came across a
problem in which the system does nothing for several seconds. This time is
the time of the target machine. This behavior is accompanied by the
following message on the console for target machine --

hda: dma_timer_expiry: dma status == 0x64
hda: DMA interrupt recovery
hda: lost interrupt

Joel Hestness, who had seen the problem before, provided me with a patch
that solves the problem. From my conversation with Joel and after looking
at the code my self, it appears that the problem is with the fact that the
commit stage of the pipeline keeps a local copy of the interrupt object.
Since the interrupt is usually handled several cycles after the commit
stage becomes aware of it, it is possible that the local copy of the
interrupt object may not be the correct interrupt when the interrupt is
actually handled. It is possible that another interrupt occurred in the
interval between interrupt detection and interrupt handling. I am
proposing the following solution (slightly different from Joel's proposal)
to handle this problem:

diff --git a/src/cpu/o3/commit_impl.hh b/src/cpu/o3/commit_impl.hh
--- a/src/cpu/o3/commit_impl.hh
+++ b/src/cpu/o3/commit_impl.hh
@@ -753,7 +753,7 @@
}

// CPU will handle interrupt.
- cpu->processInterrupts(interrupt);
+ cpu->processInterrupts(cpu->getInterrupts());

thread[0]->noSquashFromTC = false;

The code above ignores the local copy of the interrupt object and fetches
a new one from the CPU object.

There are several questions that need to be addressed here. Is there an
actual bug in the o3 cpu? Is the diagnosis correct? Is the solution
acceptable?

Thanks
Nilay

Ali Saidi

2013-03-21 14:49:25 UTC

Permalink

Hi Nilay,

I'm a little surprised at this issue. I could see that
another higher priority interrupt comes along, but if that is the case
it should be taken after the lower priority interrupt (nearly
immediately). It could be considered a bug, but on the other hand the
CPU has to decide what interrupt it is going to take at some point and
act on it, so perhaps not.

I'm guessing something like the following
happens:

assert(low_priority);

interrupt = getInterrupt(); //
low_priority

assert(high_priority);

handle_interrupt(); // just
clears the highest priority interrupt set??

I suppose I don't have any
issues with your fix, provided there is a nice comment describing
what/why. I'd like to see if we are in fact doing the above.

Thanks,

Ali

Post by Nilay
Hi
While testing some

patches for the x86 architecture, I came across a

Post by Nilay
problem in which the

system does nothing for several seconds. This time is

Post by Nilay
the time of the

target machine. This behavior is accompanied by the

Post by Nilay
following message

on the console for target machine --

Post by Nilay
hda: dma_timer_expiry: dma

status == 0x64

Post by Nilay
hda: DMA interrupt recovery
hda: lost interrupt

Joel Hestness, who had seen the problem before, provided me with a
patch

Post by Nilay
that solves the problem. From my conversation with Joel and

after looking

Post by Nilay
at the code my self, it appears that the problem is with

the fact that the

Post by Nilay
commit stage of the pipeline keeps a local copy of

the interrupt object.

Post by Nilay
Since the interrupt is usually handled several

cycles after the commit

Post by Nilay
stage becomes aware of it, it is possible that

the local copy of the

Post by Nilay
interrupt object may not be the correct

interrupt when the interrupt is

Post by Nilay
actually handled. It is possible that

another interrupt occurred in the

Post by Nilay
interval between interrupt detection

and interrupt handling. I am

Post by Nilay
proposing the following solution

(slightly different from Joel's proposal)
diff --git a/src/cpu/o3/commit_impl.hh b/src/cpu/o3/commit_impl.hh

Post by Nilay
---

a/src/cpu/o3/commit_impl.hh

Post by Nilay
+++ b/src/cpu/o3/commit_impl.hh
@@
}
// CPU will handle interrupt.
-

cpu->processInterrupts(interrupt);

Post by Nilay
+

cpu->processInterrupts(cpu->getInterrupts());
thread[0]->noSquashFromTC = false;

Post by Nilay
The code above ignores the local

copy of the interrupt object and fetches

Post by Nilay
a new one from the CPU

object.

Post by Nilay
There are several questions that need to be addressed here.

Is there an

Post by Nilay
actual bug in the o3 cpu? Is the diagnosis correct? Is the

solution

Post by Nilay
acceptable?
Thanks
Nilay

_______________________________________________

Post by Nilay
gem5-dev mailing

list

Post by Nilay
http://m5sim.org/mailman/listinfo/gem5-dev

[1]

Links:
------
[1] http://m5sim.org/mailman/listinfo/gem5-dev

Joel Hestness

2013-03-21 16:33:56 UTC

Permalink

Hey guys,
Ali's assessment is correct, namely, that higher priority interrupts can
be dropped if the x86 interrupts device receives a low priority interrupt
followed by a high priority interrupt before either is serviced. From what
I recall, this problem is unique to x86, because it is the only
architecture gem5 supports that allows multiple concurrent interrupts.
This means that you can consider servicing interrupts out of order - in
this case, servicing the higher priority interrupt first - unlike other
architectures, which will service the interrupts in FIFO order.

@Nilay:
Your solution is definitely a step in the right direction, but I think we
should go a step farther, because this will still instantiate and drop
fault object instances with the interrupt pointer. Yes, faults are
ref-counted, but we don't even need to instantiate them: Your solution
reduces the commit stage interrupt pointer to a bool which indicates
whether there is an interrupt pending. Why not replace the pointer with a
bool, say "interruptPending", which is used to signal whether an interrupt
is pending instead of the interrupt pointer being non-NULL for all gating
in the commit pipeline? Instead of calling getInterrupts() as we do
currently and having commit hang onto a fault instance with the interrupt
pointer, the O3CPU should call a different (new) interrupts device
interface function isInterruptPending(), which returns a bool if there is
an interrupt to service. This would greatly clarify the use of these
variables in the commit stage and it would reduce the risk of future code
modifications that mishandle the interrupt pointer. The obvious downside
of this solution is that it will require updating the interrupts devices
for every architecture, which may be a fair amount of work, and testing
these changes is very non-trivial.

Thoughts?
Joel

Post by Ali Saidi
Hi Nilay,
I'm a little surprised at this issue. I could see that
another higher priority interrupt comes along, but if that is the case
it should be taken after the lower priority interrupt (nearly
immediately). It could be considered a bug, but on the other hand the
CPU has to decide what interrupt it is going to take at some point and
act on it, so perhaps not.
I'm guessing something like the following
assert(low_priority);
interrupt = getInterrupt(); //
low_priority
assert(high_priority);
handle_interrupt(); // just
clears the highest priority interrupt set??
I suppose I don't have any
issues with your fix, provided there is a nice comment describing
what/why. I'd like to see if we are in fact doing the above.
Thanks,
Ali

Post by Nilay
Hi
While testing some

patches for the x86 architecture, I came across a

Post by Nilay
problem in which the

system does nothing for several seconds. This time is

Post by Nilay
the time of the

target machine. This behavior is accompanied by the

Post by Nilay
following message

on the console for target machine --

Post by Nilay
hda: dma_timer_expiry: dma

status == 0x64

Post by Nilay
hda: DMA interrupt recovery
hda: lost interrupt

Joel Hestness, who had seen the problem before, provided me with a patch

Post by Nilay
that solves the problem. From my conversation with Joel and

after looking

Post by Nilay
at the code my self, it appears that the problem is with

the fact that the

Post by Nilay
commit stage of the pipeline keeps a local copy of

the interrupt object.

Post by Nilay
Since the interrupt is usually handled several

cycles after the commit

Post by Nilay
stage becomes aware of it, it is possible that

the local copy of the

Post by Nilay
interrupt object may not be the correct

interrupt when the interrupt is

Post by Nilay
actually handled. It is possible that

another interrupt occurred in the

Post by Nilay
interval between interrupt detection

and interrupt handling. I am

Post by Nilay
proposing the following solution

(slightly different from Joel's proposal)
diff --git a/src/cpu/o3/commit_impl.hh b/src/cpu/o3/commit_impl.hh

Post by Nilay
---

a/src/cpu/o3/commit_impl.hh

Post by Nilay
+++ b/src/cpu/o3/commit_impl.hh
@@
}
// CPU will handle interrupt.
-

cpu->processInterrupts(interrupt);

Post by Nilay
+

cpu->processInterrupts(cpu->getInterrupts());
thread[0]->noSquashFromTC = false;

Post by Nilay
The code above ignores the local

copy of the interrupt object and fetches

Post by Nilay
a new one from the CPU

object.

Post by Nilay
There are several questions that need to be addressed here.

Is there an

Post by Nilay
actual bug in the o3 cpu? Is the diagnosis correct? Is the

solution

Post by Nilay
acceptable?
Thanks
Nilay

_______________________________________________

Post by Nilay
gem5-dev mailing

list

Post by Nilay
http://m5sim.org/mailman/listinfo/gem5-dev

[1]
------
[1] http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

--
Joel Hestness
PhD Student, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://www.cs.utexas.edu/~hestness

Nilay

2013-03-23 20:20:09 UTC

Permalink

Post by Joel Hestness
Hey guys,
Ali's assessment is correct, namely, that higher priority interrupts can
be dropped if the x86 interrupts device receives a low priority interrupt
followed by a high priority interrupt before either is serviced. From what
I recall, this problem is unique to x86, because it is the only
architecture gem5 supports that allows multiple concurrent interrupts.
This means that you can consider servicing interrupts out of order - in
this case, servicing the higher priority interrupt first - unlike other
architectures, which will service the interrupts in FIFO order.
Your solution is definitely a step in the right direction, but I think we
should go a step farther, because this will still instantiate and drop
fault object instances with the interrupt pointer. Yes, faults are
ref-counted, but we don't even need to instantiate them: Your solution
reduces the commit stage interrupt pointer to a bool which indicates
whether there is an interrupt pending. Why not replace the pointer with a
bool, say "interruptPending", which is used to signal whether an interrupt
is pending instead of the interrupt pointer being non-NULL for all gating
in the commit pipeline? Instead of calling getInterrupts() as we do
currently and having commit hang onto a fault instance with the interrupt
pointer, the O3CPU should call a different (new) interrupts device
interface function isInterruptPending(), which returns a bool if there is
an interrupt to service. This would greatly clarify the use of these
variables in the commit stage and it would reduce the risk of future code
modifications that mishandle the interrupt pointer. The obvious downside
of this solution is that it will require updating the interrupts devices
for every architecture, which may be a fair amount of work, and testing
these changes is very non-trivial.

Note that the function for checking whether or not interrupts are pending
already exists. It is called checkInterrupts(). But the suggested solution
will not work for the alpha architecture. This is because in alpha, the
functions checkInterrupts() and getInterrupts() do not always agree.

--
Nilay

Nilay

2013-03-25 21:52:53 UTC

Permalink

Ali, here is trace that shows the problem:

---------------------------------------------------------------------------
5103756467000: system.cpu.interrupts: Got Trigger Interrupt message with
vector 0x30. // Here is the interrupt object receives the interrupt
5103756467000: system.cpu.interrupts: Interrupt is an LowestPriority.
.
.
.
5103756496000: system.cpu.interrupts: Generated regular interrupt (48)
fault object. // Interrupt object created for the CPU
.
.
.
5103791689000: system.pc.south_bridge.ide.disks0: Posting Interrupt // IDE
controller raises the interrupt line
5103791740000: system.cpu.interrupts: Got Trigger Interrupt message with
vector 0x3e. // Interrupt object receives the interrupt from the IDE
controller
5103791740000: system.cpu.interrupts: Interrupt is an LowestPriority.
.
.
.
5103791744000: system.cpu.interrupts: Interrupt 62 sent to core. // CPU
asks for interrupt information to be updated i.e. some interrupt is being
handled.

---------------------------------------------------------------------------

The problem lies in the o3 cpu itself. In the following piece code from
cpu/o3/cpu.cc file, the cpu can possibly update the interrupt info for a
different interrupt and invoke the interrupt handler for a different
interrupt. This is what happened above. The interrupt object thinks that
interrupt 62 is being handled, while the CPU actually handled interrupt
48. This results in IDE controller waiting for about 20 seconds, before it
switches from DMA to PIO mode and redoes all the work.

template <class Impl>
void
FullO3CPU<Impl>::processInterrupts(Fault interrupt)
{
// Check for interrupts here. For now can copy the code that
// exists within isa_fullsys_traits.hh. Also assume that thread 0
// is the one that handles the interrupts.
// @todo: Possibly consolidate the interrupt checking code.
// @todo: Allow other threads to handle interrupts.

assert(interrupt != NoFault);
this->interrupts->updateIntrInfo(this->threadContexts[0]);

DPRINTF(O3CPU, "Interrupt %s being handled\n", interrupt->name());
this->trap(interrupt, 0, NULL);
}

--
Nilay

Ali Saidi

2013-04-15 01:37:17 UTC

Permalink

Hi Nilay,

Thanks for digging even further into this. Is Alpha the only one where checkInterrupts and getInterrupts don't agree? Perhaps we can just fix the Alpha implementation and that will solve the problem. It seems like in Alpha the checkInterrupts() function is just there to speed up execution when there isn't the possibility of an interrupt pending, but it doesn't handle all the tests. Does that work for everyone?

Thanks,
Aii

Post by Nilay
---------------------------------------------------------------------------
5103756467000: system.cpu.interrupts: Got Trigger Interrupt message with
vector 0x30. // Here is the interrupt object receives the interrupt
5103756467000: system.cpu.interrupts: Interrupt is an LowestPriority.
.
.
.
5103756496000: system.cpu.interrupts: Generated regular interrupt (48)
fault object. // Interrupt object created for the CPU
.
.
.
5103791689000: system.pc.south_bridge.ide.disks0: Posting Interrupt // IDE
controller raises the interrupt line
5103791740000: system.cpu.interrupts: Got Trigger Interrupt message with
vector 0x3e. // Interrupt object receives the interrupt from the IDE
controller
5103791740000: system.cpu.interrupts: Interrupt is an LowestPriority.
.
.
.
5103791744000: system.cpu.interrupts: Interrupt 62 sent to core. // CPU
asks for interrupt information to be updated i.e. some interrupt is being
handled.
---------------------------------------------------------------------------
The problem lies in the o3 cpu itself. In the following piece code from
cpu/o3/cpu.cc file, the cpu can possibly update the interrupt info for a
different interrupt and invoke the interrupt handler for a different
interrupt. This is what happened above. The interrupt object thinks that
interrupt 62 is being handled, while the CPU actually handled interrupt
48. This results in IDE controller waiting for about 20 seconds, before it
switches from DMA to PIO mode and redoes all the work.
template <class Impl>
void
FullO3CPU<Impl>::processInterrupts(Fault interrupt)
{
// Check for interrupts here. For now can copy the code that
// exists within isa_fullsys_traits.hh. Also assume that thread 0
// is the one that handles the interrupts.
assert(interrupt != NoFault);
this->interrupts->updateIntrInfo(this->threadContexts[0]);
DPRINTF(O3CPU, "Interrupt %s being handled\n", interrupt->name());
this->trap(interrupt, 0, NULL);
}
--
Nilay
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

Nilay Vaish

2013-04-16 21:56:03 UTC

Permalink

Post by Ali Saidi
Hi Nilay,
Thanks for digging even further into this. Is Alpha the only one where
checkInterrupts and getInterrupts don't agree? Perhaps we can just fix
the Alpha implementation and that will solve the problem. It seems like
in Alpha the checkInterrupts() function is just there to speed up
execution when there isn't the possibility of an interrupt pending, but
it doesn't handle all the tests. Does that work for everyone?

A cursory glance at functions for Mips and Sparc revealed that both of
them are doing different stuff in get() and check() functions.

--
Nilay

Ali Saidi

2013-04-27 14:24:03 UTC

Permalink

Post by Ali Saidi
Hi Nilay,
Thanks for digging even further into this. Is Alpha the only one where checkInterrupts and getInterrupts don't agree? Perhaps we can just fix the Alpha implementation and that will solve the problem. It seems like in Alpha the checkInterrupts() function is just there to speed up execution when there isn't the possibility of an interrupt pending, but it doesn't handle all the tests. Does that work for everyone?

A cursory glance at functions for Mips and Sparc revealed that both of them are doing different stuff in get() and check() functions.

Ok. We probably need to have 3 functions. possibleInterrupt(), checkInterrupt() and getInterrupt(). The implementation in checkInterrupt today can move into possibleInterrupt() and the current implementation of getInterrupt() would need to be refactored for all the architectures to either fully check or check and return. Thoughts?

Ali