Discussion:
[gem5-dev] squashing bug in O3
(too old to reply)
Gabe Black
2011-11-13 23:40:20 UTC
Permalink
Hey folks. Ali has had a change out for a while ("Fix several Branch
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is that
an instruction is started which returns from kernel to user level and is
microcoded. The instruction is fetched from the kernel's address space
successfully and starts to execute, along the way dropping down to user
mode. Some microops later, there's some microop control flow which O3
mispredicts. When it squashes the mispredict and tries to restart, it
first tries to refetch the instruction involved. Since it's now at user
level and the instruction is on a kernel level only page, there's a page
fault and things go downhill from there.

I partially implemented a solution to this before where O3 reinstates
the macroop it had been using when it restarts fetch. The problem here
is that the path this kind of squash takes doesn't pass back the right
information, and my attempts to fix that have been unsuccessful. The
code that handles squashing in O3 is too complex, there's too much going
in all directions, it's not always very clear what affect a change will
have in unrelated situations, or which callsites are involved in a
particular type of fault.

To me, it seems like the first step in fixing this problem is to clean
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.

Without having really dug into the specifics, I think we only need two
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It would
start at it if the instruction needed to be reexecuted due to a memory
dependence violation, for instance, and would start after it for faults,
interrupts, or branch mispredicts. Any other information that's needed
like sequence numbers or actual control flow targets can be retrieved
from the instructions where needed without having to split everything
out and pass them around individually.

Is there any obvious problem with doing things this way? I don't think
I'll personally have a lot of time to dedicate to this at the very least
in the short term, but I wanted to get the conversation going so we know
what to do when somebody has a chance to do it.

Gabe
Ali Saidi
2011-11-14 00:16:17 UTC
Permalink
I think this bug is just latently in the code right now and the branch predictor change runs into it (this patch causes that branch to be mispredicted). In any case I think the issue exists today and it's just luck that it works currently.

Looking at your list I imagine you should be able to recover most things from the dyninst, however I don't know if that is actually the case. Excepted that the squashing mechanisms should be cleaned up, I'm not sure how that is actually going to solve the problem. Don't we currently send back the instruction? With the current instructions can't you figure out the macro-op it belongs to?

Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several Branch
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is that
an instruction is started which returns from kernel to user level and is
microcoded. The instruction is fetched from the kernel's address space
successfully and starts to execute, along the way dropping down to user
mode. Some microops later, there's some microop control flow which O3
mispredicts. When it squashes the mispredict and tries to restart, it
first tries to refetch the instruction involved. Since it's now at user
level and the instruction is on a kernel level only page, there's a page
fault and things go downhill from there.
I partially implemented a solution to this before where O3 reinstates
the macroop it had been using when it restarts fetch. The problem here
is that the path this kind of squash takes doesn't pass back the right
information, and my attempts to fix that have been unsuccessful. The
code that handles squashing in O3 is too complex, there's too much going
in all directions, it's not always very clear what affect a change will
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to clean
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need two
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It would
start at it if the instruction needed to be reexecuted due to a memory
dependence violation, for instance, and would start after it for faults,
interrupts, or branch mispredicts. Any other information that's needed
like sequence numbers or actual control flow targets can be retrieved
from the instructions where needed without having to split everything
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't think
I'll personally have a lot of time to dedicate to this at the very least
in the short term, but I wanted to get the conversation going so we know
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2011-11-14 00:20:15 UTC
Permalink
Should not such an instruction be executed non-speculatively?

--
Nilay
Post by Ali Saidi
I think this bug is just latently in the code right now and the branch predictor change runs into it (this patch causes that branch to be mispredicted). In any case I think the issue exists today and it's just luck that it works currently.
Looking at your list I imagine you should be able to recover most things from the dyninst, however I don't know if that is actually the case. Excepted that the squashing mechanisms should be cleaned up, I'm not sure how that is actually going to solve the problem. Don't we currently send back the instruction? With the current instructions can't you figure out the macro-op it belongs to?
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several Branch
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is that
an instruction is started which returns from kernel to user level and is
microcoded. The instruction is fetched from the kernel's address space
successfully and starts to execute, along the way dropping down to user
mode. Some microops later, there's some microop control flow which O3
mispredicts. When it squashes the mispredict and tries to restart, it
first tries to refetch the instruction involved. Since it's now at user
level and the instruction is on a kernel level only page, there's a page
fault and things go downhill from there.
I partially implemented a solution to this before where O3 reinstates
the macroop it had been using when it restarts fetch. The problem here
is that the path this kind of squash takes doesn't pass back the right
information, and my attempts to fix that have been unsuccessful. The
code that handles squashing in O3 is too complex, there's too much going
in all directions, it's not always very clear what affect a change will
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to clean
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need two
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It would
start at it if the instruction needed to be reexecuted due to a memory
dependence violation, for instance, and would start after it for faults,
interrupts, or branch mispredicts. Any other information that's needed
like sequence numbers or actual control flow targets can be retrieved
from the instructions where needed without having to split everything
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't think
I'll personally have a lot of time to dedicate to this at the very least
in the short term, but I wanted to get the conversation going so we know
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-11-14 00:33:02 UTC
Permalink
It's not an instruction, it's a macroop that gets executed as microops.
Macroops can't be non-speculative since they aren't actually executed.
The microops can be non-speculative (and the one in question probably
is) but that doesn't help with branch misprediction later in the same
macroop.

Gabe
Post by Nilay Vaish
Should not such an instruction be executed non-speculatively?
--
Nilay
Post by Ali Saidi
I think this bug is just latently in the code right now and the
branch predictor change runs into it (this patch causes that branch
to be mispredicted). In any case I think the issue exists today and
it's just luck that it works currently.
Looking at your list I imagine you should be able to recover most
things from the dyninst, however I don't know if that is actually the
case. Excepted that the squashing mechanisms should be cleaned up,
I'm not sure how that is actually going to solve the problem. Don't
we currently send back the instruction? With the current instructions
can't you figure out the macro-op it belongs to?
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several Branch
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is that
an instruction is started which returns from kernel to user level and is
microcoded. The instruction is fetched from the kernel's address space
successfully and starts to execute, along the way dropping down to user
mode. Some microops later, there's some microop control flow which O3
mispredicts. When it squashes the mispredict and tries to restart, it
first tries to refetch the instruction involved. Since it's now at user
level and the instruction is on a kernel level only page, there's a page
fault and things go downhill from there.
I partially implemented a solution to this before where O3 reinstates
the macroop it had been using when it restarts fetch. The problem here
is that the path this kind of squash takes doesn't pass back the right
information, and my attempts to fix that have been unsuccessful. The
code that handles squashing in O3 is too complex, there's too much going
in all directions, it's not always very clear what affect a change will
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to clean
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need two
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It would
start at it if the instruction needed to be reexecuted due to a memory
dependence violation, for instance, and would start after it for faults,
interrupts, or branch mispredicts. Any other information that's needed
like sequence numbers or actual control flow targets can be retrieved
from the instructions where needed without having to split everything
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't think
I'll personally have a lot of time to dedicate to this at the very least
in the short term, but I wanted to get the conversation going so we know
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-11-14 00:34:53 UTC
Permalink
Yes, this is an existing bug and the branch predictor just pokes things
in the right way to expose it. The macroop isn't passed back in this
particular case, and with the code the way it is, it's difficult to even
tell that that's the case, let alone how to fix it. Cleaning things up
won't fix the problem itself, but it will make fixing the actual problem
tractable.

Gabe
Post by Ali Saidi
I think this bug is just latently in the code right now and the branch predictor change runs into it (this patch causes that branch to be mispredicted). In any case I think the issue exists today and it's just luck that it works currently.
Looking at your list I imagine you should be able to recover most things from the dyninst, however I don't know if that is actually the case. Excepted that the squashing mechanisms should be cleaned up, I'm not sure how that is actually going to solve the problem. Don't we currently send back the instruction? With the current instructions can't you figure out the macro-op it belongs to?
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several Branch
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is that
an instruction is started which returns from kernel to user level and is
microcoded. The instruction is fetched from the kernel's address space
successfully and starts to execute, along the way dropping down to user
mode. Some microops later, there's some microop control flow which O3
mispredicts. When it squashes the mispredict and tries to restart, it
first tries to refetch the instruction involved. Since it's now at user
level and the instruction is on a kernel level only page, there's a page
fault and things go downhill from there.
I partially implemented a solution to this before where O3 reinstates
the macroop it had been using when it restarts fetch. The problem here
is that the path this kind of squash takes doesn't pass back the right
information, and my attempts to fix that have been unsuccessful. The
code that handles squashing in O3 is too complex, there's too much going
in all directions, it's not always very clear what affect a change will
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to clean
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need two
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It would
start at it if the instruction needed to be reexecuted due to a memory
dependence violation, for instance, and would start after it for faults,
interrupts, or branch mispredicts. Any other information that's needed
like sequence numbers or actual control flow targets can be retrieved
from the instructions where needed without having to split everything
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't think
I'll personally have a lot of time to dedicate to this at the very least
in the short term, but I wanted to get the conversation going so we know
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-11-14 03:21:48 UTC
Permalink
I'd like to understand the issue a little better before commenting on a
solution.

Gabe, when you say "instruction" in your original description, do you mean
micro-op?

It seems to me that the fundamental problem is that we're trying to undo
the effects of a non-speculative micro-op, correct? So the solution you're
pursuing is that branch mispredictions only roll back to the offending
micro-op, and don't force the entire macro-op containing that micro-op to
re-execute?

Is this predicted control flow entirely internal to the macro-op? Or is
this an RFI where we are integrating the control transfer and the privilege
change? If it is the latter, why does the RFI need to get squashed at all?

Steve
Post by Gabe Black
Yes, this is an existing bug and the branch predictor just pokes things
in the right way to expose it. The macroop isn't passed back in this
particular case, and with the code the way it is, it's difficult to even
tell that that's the case, let alone how to fix it. Cleaning things up
won't fix the problem itself, but it will make fixing the actual problem
tractable.
Gabe
Post by Ali Saidi
I think this bug is just latently in the code right now and the branch
predictor change runs into it (this patch causes that branch to be
mispredicted). In any case I think the issue exists today and it's just
luck that it works currently.
Post by Ali Saidi
Looking at your list I imagine you should be able to recover most things
from the dyninst, however I don't know if that is actually the case.
Excepted that the squashing mechanisms should be cleaned up, I'm not sure
how that is actually going to solve the problem. Don't we currently send
back the instruction? With the current instructions can't you figure out
the macro-op it belongs to?
Post by Ali Saidi
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several Branch
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is that
an instruction is started which returns from kernel to user level and is
microcoded. The instruction is fetched from the kernel's address space
successfully and starts to execute, along the way dropping down to user
mode. Some microops later, there's some microop control flow which O3
mispredicts. When it squashes the mispredict and tries to restart, it
first tries to refetch the instruction involved. Since it's now at user
level and the instruction is on a kernel level only page, there's a page
fault and things go downhill from there.
I partially implemented a solution to this before where O3 reinstates
the macroop it had been using when it restarts fetch. The problem here
is that the path this kind of squash takes doesn't pass back the right
information, and my attempts to fix that have been unsuccessful. The
code that handles squashing in O3 is too complex, there's too much going
in all directions, it's not always very clear what affect a change will
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to clean
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need two
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It would
start at it if the instruction needed to be reexecuted due to a memory
dependence violation, for instance, and would start after it for faults,
interrupts, or branch mispredicts. Any other information that's needed
like sequence numbers or actual control flow targets can be retrieved
from the instructions where needed without having to split everything
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't think
I'll personally have a lot of time to dedicate to this at the very least
in the short term, but I wanted to get the conversation going so we know
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-11-14 04:20:57 UTC
Permalink
No, we're not trying to undo anything. An example might help. Lets look
at a dramatically simplified version of iret, the instruction that
returns from an interrupt handler. The microops might do the following.

1. Restore prior privilege level.
2. If we were in kernel level, skip to 4.
3. Restore user level stack.
4. End.

O3 fetches the bytes that go with iret, decodes that to a macroop, and
starts picking microops out of it. Microop 1 is executed and drops to
user level. Now microop 2 is executed, and O3 misspeculates that the
branch is taken (for example). The mispredict is detected, and later
microops in flight are squashed. O3 then attempts to restart where it
should have gone, microop 3.

Now, O3 looks at the PC involved and starts fetching the bytes which
become the macroop which the microops are pulled from. Because microop 1
successfully completed, the CPU is now at user level, but because the
iret is on a kernel page, it can't be accessed. The kernel gets a page
fault.

As I mentioned before, my partially implemented fix is to not only pass
back the PC, but to also pass back the macroop fetch should use instead
of making it refetch memory. The problem is that it's partially
implemented, and the way squashes work in O3 make it really tricky to
implement it properly, or to tell whether or not it's implemented properly.

Gabe
Post by Steve Reinhardt
I'd like to understand the issue a little better before commenting on a
solution.
Gabe, when you say "instruction" in your original description, do you mean
micro-op?
It seems to me that the fundamental problem is that we're trying to undo
the effects of a non-speculative micro-op, correct? So the solution you're
pursuing is that branch mispredictions only roll back to the offending
micro-op, and don't force the entire macro-op containing that micro-op to
re-execute?
Is this predicted control flow entirely internal to the macro-op? Or is
this an RFI where we are integrating the control transfer and the privilege
change? If it is the latter, why does the RFI need to get squashed at all?
Steve
Post by Gabe Black
Yes, this is an existing bug and the branch predictor just pokes things
in the right way to expose it. The macroop isn't passed back in this
particular case, and with the code the way it is, it's difficult to even
tell that that's the case, let alone how to fix it. Cleaning things up
won't fix the problem itself, but it will make fixing the actual problem
tractable.
Gabe
Post by Ali Saidi
I think this bug is just latently in the code right now and the branch
predictor change runs into it (this patch causes that branch to be
mispredicted). In any case I think the issue exists today and it's just
luck that it works currently.
Post by Ali Saidi
Looking at your list I imagine you should be able to recover most things
from the dyninst, however I don't know if that is actually the case.
Excepted that the squashing mechanisms should be cleaned up, I'm not sure
how that is actually going to solve the problem. Don't we currently send
back the instruction? With the current instructions can't you figure out
the macro-op it belongs to?
Post by Ali Saidi
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several Branch
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is that
an instruction is started which returns from kernel to user level and is
microcoded. The instruction is fetched from the kernel's address space
successfully and starts to execute, along the way dropping down to user
mode. Some microops later, there's some microop control flow which O3
mispredicts. When it squashes the mispredict and tries to restart, it
first tries to refetch the instruction involved. Since it's now at user
level and the instruction is on a kernel level only page, there's a page
fault and things go downhill from there.
I partially implemented a solution to this before where O3 reinstates
the macroop it had been using when it restarts fetch. The problem here
is that the path this kind of squash takes doesn't pass back the right
information, and my attempts to fix that have been unsuccessful. The
code that handles squashing in O3 is too complex, there's too much going
in all directions, it's not always very clear what affect a change will
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to clean
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need two
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It would
start at it if the instruction needed to be reexecuted due to a memory
dependence violation, for instance, and would start after it for faults,
interrupts, or branch mispredicts. Any other information that's needed
like sequence numbers or actual control flow targets can be retrieved
from the instructions where needed without having to split everything
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't think
I'll personally have a lot of time to dedicate to this at the very least
in the short term, but I wanted to get the conversation going so we know
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-11-14 05:14:14 UTC
Permalink
Thanks for the more detailed explanation... that helped a lot. Sounds to
me like you're on the right track.

Steve
Post by Gabe Black
No, we're not trying to undo anything. An example might help. Lets look
at a dramatically simplified version of iret, the instruction that
returns from an interrupt handler. The microops might do the following.
1. Restore prior privilege level.
2. If we were in kernel level, skip to 4.
3. Restore user level stack.
4. End.
O3 fetches the bytes that go with iret, decodes that to a macroop, and
starts picking microops out of it. Microop 1 is executed and drops to
user level. Now microop 2 is executed, and O3 misspeculates that the
branch is taken (for example). The mispredict is detected, and later
microops in flight are squashed. O3 then attempts to restart where it
should have gone, microop 3.
Now, O3 looks at the PC involved and starts fetching the bytes which
become the macroop which the microops are pulled from. Because microop 1
successfully completed, the CPU is now at user level, but because the
iret is on a kernel page, it can't be accessed. The kernel gets a page
fault.
As I mentioned before, my partially implemented fix is to not only pass
back the PC, but to also pass back the macroop fetch should use instead
of making it refetch memory. The problem is that it's partially
implemented, and the way squashes work in O3 make it really tricky to
implement it properly, or to tell whether or not it's implemented properly.
Gabe
Post by Steve Reinhardt
I'd like to understand the issue a little better before commenting on a
solution.
Gabe, when you say "instruction" in your original description, do you
mean
Post by Steve Reinhardt
micro-op?
It seems to me that the fundamental problem is that we're trying to undo
the effects of a non-speculative micro-op, correct? So the solution
you're
Post by Steve Reinhardt
pursuing is that branch mispredictions only roll back to the offending
micro-op, and don't force the entire macro-op containing that micro-op to
re-execute?
Is this predicted control flow entirely internal to the macro-op? Or is
this an RFI where we are integrating the control transfer and the
privilege
Post by Steve Reinhardt
change? If it is the latter, why does the RFI need to get squashed at
all?
Post by Steve Reinhardt
Steve
Post by Gabe Black
Yes, this is an existing bug and the branch predictor just pokes things
in the right way to expose it. The macroop isn't passed back in this
particular case, and with the code the way it is, it's difficult to even
tell that that's the case, let alone how to fix it. Cleaning things up
won't fix the problem itself, but it will make fixing the actual problem
tractable.
Gabe
Post by Ali Saidi
I think this bug is just latently in the code right now and the branch
predictor change runs into it (this patch causes that branch to be
mispredicted). In any case I think the issue exists today and it's just
luck that it works currently.
Post by Ali Saidi
Looking at your list I imagine you should be able to recover most
things
Post by Steve Reinhardt
Post by Gabe Black
from the dyninst, however I don't know if that is actually the case.
Excepted that the squashing mechanisms should be cleaned up, I'm not
sure
Post by Steve Reinhardt
Post by Gabe Black
how that is actually going to solve the problem. Don't we currently send
back the instruction? With the current instructions can't you figure out
the macro-op it belongs to?
Post by Ali Saidi
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several Branch
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is
that
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
an instruction is started which returns from kernel to user level and
is
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
microcoded. The instruction is fetched from the kernel's address space
successfully and starts to execute, along the way dropping down to
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mode. Some microops later, there's some microop control flow which O3
mispredicts. When it squashes the mispredict and tries to restart, it
first tries to refetch the instruction involved. Since it's now at
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
level and the instruction is on a kernel level only page, there's a
page
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
fault and things go downhill from there.
I partially implemented a solution to this before where O3 reinstates
the macroop it had been using when it restarts fetch. The problem here
is that the path this kind of squash takes doesn't pass back the right
information, and my attempts to fix that have been unsuccessful. The
code that handles squashing in O3 is too complex, there's too much
going
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in all directions, it's not always very clear what affect a change
will
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to clean
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need two
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It
would
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
start at it if the instruction needed to be reexecuted due to a memory
dependence violation, for instance, and would start after it for
faults,
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
interrupts, or branch mispredicts. Any other information that's needed
like sequence numbers or actual control flow targets can be retrieved
from the instructions where needed without having to split everything
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't think
I'll personally have a lot of time to dedicate to this at the very
least
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in the short term, but I wanted to get the conversation going so we
know
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay
2011-11-14 06:54:22 UTC
Permalink
Well, I still don't get it. Do out-of-order CPUs speculate on iret? If
iret is to be executed non-speculatively, I would expect micro-ops that
are part of iret are executed non-speculatively.

--
Nilay
Post by Steve Reinhardt
Thanks for the more detailed explanation... that helped a lot. Sounds to
me like you're on the right track.
Steve
Post by Gabe Black
No, we're not trying to undo anything. An example might help. Lets look
at a dramatically simplified version of iret, the instruction that
returns from an interrupt handler. The microops might do the following.
1. Restore prior privilege level.
2. If we were in kernel level, skip to 4.
3. Restore user level stack.
4. End.
O3 fetches the bytes that go with iret, decodes that to a macroop, and
starts picking microops out of it. Microop 1 is executed and drops to
user level. Now microop 2 is executed, and O3 misspeculates that the
branch is taken (for example). The mispredict is detected, and later
microops in flight are squashed. O3 then attempts to restart where it
should have gone, microop 3.
Now, O3 looks at the PC involved and starts fetching the bytes which
become the macroop which the microops are pulled from. Because microop 1
successfully completed, the CPU is now at user level, but because the
iret is on a kernel page, it can't be accessed. The kernel gets a page
fault.
As I mentioned before, my partially implemented fix is to not only pass
back the PC, but to also pass back the macroop fetch should use instead
of making it refetch memory. The problem is that it's partially
implemented, and the way squashes work in O3 make it really tricky to
implement it properly, or to tell whether or not it's implemented properly.
Gabe
Post by Steve Reinhardt
I'd like to understand the issue a little better before commenting on
a
Post by Steve Reinhardt
solution.
Gabe, when you say "instruction" in your original description, do you
mean
Post by Steve Reinhardt
micro-op?
It seems to me that the fundamental problem is that we're trying to
undo
Post by Steve Reinhardt
the effects of a non-speculative micro-op, correct? So the solution
you're
Post by Steve Reinhardt
pursuing is that branch mispredictions only roll back to the offending
micro-op, and don't force the entire macro-op containing that micro-op
to
Post by Steve Reinhardt
re-execute?
Is this predicted control flow entirely internal to the macro-op? Or
is
Post by Steve Reinhardt
this an RFI where we are integrating the control transfer and the
privilege
Post by Steve Reinhardt
change? If it is the latter, why does the RFI need to get squashed at
all?
Post by Steve Reinhardt
Steve
Post by Gabe Black
Yes, this is an existing bug and the branch predictor just pokes
things
Post by Steve Reinhardt
Post by Gabe Black
in the right way to expose it. The macroop isn't passed back in this
particular case, and with the code the way it is, it's difficult to
even
Post by Steve Reinhardt
Post by Gabe Black
tell that that's the case, let alone how to fix it. Cleaning things
up
Post by Steve Reinhardt
Post by Gabe Black
won't fix the problem itself, but it will make fixing the actual
problem
Post by Steve Reinhardt
Post by Gabe Black
tractable.
Gabe
Post by Ali Saidi
I think this bug is just latently in the code right now and the
branch
Post by Steve Reinhardt
Post by Gabe Black
predictor change runs into it (this patch causes that branch to be
mispredicted). In any case I think the issue exists today and it's
just
Post by Steve Reinhardt
Post by Gabe Black
luck that it works currently.
Post by Ali Saidi
Looking at your list I imagine you should be able to recover most
things
Post by Steve Reinhardt
Post by Gabe Black
from the dyninst, however I don't know if that is actually the case.
Excepted that the squashing mechanisms should be cleaned up, I'm not
sure
Post by Steve Reinhardt
Post by Gabe Black
how that is actually going to solve the problem. Don't we currently
send
Post by Steve Reinhardt
Post by Gabe Black
back the instruction? With the current instructions can't you figure
out
Post by Steve Reinhardt
Post by Gabe Black
the macro-op it belongs to?
Post by Ali Saidi
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several
Branch
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is
that
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
an instruction is started which returns from kernel to user level
and
is
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
microcoded. The instruction is fetched from the kernel's address
space
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
successfully and starts to execute, along the way dropping down to
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mode. Some microops later, there's some microop control flow which
O3
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mispredicts. When it squashes the mispredict and tries to restart,
it
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
first tries to refetch the instruction involved. Since it's now at
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
level and the instruction is on a kernel level only page, there's a
page
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
fault and things go downhill from there.
I partially implemented a solution to this before where O3
reinstates
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
the macroop it had been using when it restarts fetch. The problem
here
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
is that the path this kind of squash takes doesn't pass back the
right
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
information, and my attempts to fix that have been unsuccessful.
The
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
code that handles squashing in O3 is too complex, there's too much
going
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in all directions, it's not always very clear what affect a change
will
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to
clean
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need
two
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It
would
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
start at it if the instruction needed to be reexecuted due to a
memory
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
dependence violation, for instance, and would start after it for
faults,
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
interrupts, or branch mispredicts. Any other information that's
needed
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
like sequence numbers or actual control flow targets can be
retrieved
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
from the instructions where needed without having to split
everything
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't
think
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
I'll personally have a lot of time to dedicate to this at the very
least
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in the short term, but I wanted to get the conversation going so we
know
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
what to do when somebody has a chance to do it.
Gabe
Steve Reinhardt
2011-11-14 15:47:12 UTC
Permalink
That would be one solution. It would have some performance cost, but
depending on how often complex non-speculative macro-instructions get
executed, it might not be too bad.

Another question is whether it makes sense to dynamically predict internal
micro-branches with the same predictor we use for macro-instruction
branches. I honestly don't know how our processors do it, but I would not
be surprised if the dynamic predictor only worked on macro-instructions,
and micro-branches had some static hint bit or something like that. That
doesn't directly affect this bug (since you would still need recovery
regardless of how you predicted the micro-branch), but this discussion does
make me wonder if our model is realistic.

Steve
Post by Nilay
Well, I still don't get it. Do out-of-order CPUs speculate on iret? If
iret is to be executed non-speculatively, I would expect micro-ops that
are part of iret are executed non-speculatively.
--
Nilay
Post by Steve Reinhardt
Thanks for the more detailed explanation... that helped a lot. Sounds to
me like you're on the right track.
Steve
Post by Gabe Black
No, we're not trying to undo anything. An example might help. Lets look
at a dramatically simplified version of iret, the instruction that
returns from an interrupt handler. The microops might do the following.
1. Restore prior privilege level.
2. If we were in kernel level, skip to 4.
3. Restore user level stack.
4. End.
O3 fetches the bytes that go with iret, decodes that to a macroop, and
starts picking microops out of it. Microop 1 is executed and drops to
user level. Now microop 2 is executed, and O3 misspeculates that the
branch is taken (for example). The mispredict is detected, and later
microops in flight are squashed. O3 then attempts to restart where it
should have gone, microop 3.
Now, O3 looks at the PC involved and starts fetching the bytes which
become the macroop which the microops are pulled from. Because microop 1
successfully completed, the CPU is now at user level, but because the
iret is on a kernel page, it can't be accessed. The kernel gets a page
fault.
As I mentioned before, my partially implemented fix is to not only pass
back the PC, but to also pass back the macroop fetch should use instead
of making it refetch memory. The problem is that it's partially
implemented, and the way squashes work in O3 make it really tricky to
implement it properly, or to tell whether or not it's implemented properly.
Gabe
Post by Steve Reinhardt
I'd like to understand the issue a little better before commenting on
a
Post by Steve Reinhardt
solution.
Gabe, when you say "instruction" in your original description, do you
mean
Post by Steve Reinhardt
micro-op?
It seems to me that the fundamental problem is that we're trying to
undo
Post by Steve Reinhardt
the effects of a non-speculative micro-op, correct? So the solution
you're
Post by Steve Reinhardt
pursuing is that branch mispredictions only roll back to the offending
micro-op, and don't force the entire macro-op containing that micro-op
to
Post by Steve Reinhardt
re-execute?
Is this predicted control flow entirely internal to the macro-op? Or
is
Post by Steve Reinhardt
this an RFI where we are integrating the control transfer and the
privilege
Post by Steve Reinhardt
change? If it is the latter, why does the RFI need to get squashed at
all?
Post by Steve Reinhardt
Steve
Post by Gabe Black
Yes, this is an existing bug and the branch predictor just pokes
things
Post by Steve Reinhardt
Post by Gabe Black
in the right way to expose it. The macroop isn't passed back in this
particular case, and with the code the way it is, it's difficult to
even
Post by Steve Reinhardt
Post by Gabe Black
tell that that's the case, let alone how to fix it. Cleaning things
up
Post by Steve Reinhardt
Post by Gabe Black
won't fix the problem itself, but it will make fixing the actual
problem
Post by Steve Reinhardt
Post by Gabe Black
tractable.
Gabe
Post by Ali Saidi
I think this bug is just latently in the code right now and the
branch
Post by Steve Reinhardt
Post by Gabe Black
predictor change runs into it (this patch causes that branch to be
mispredicted). In any case I think the issue exists today and it's
just
Post by Steve Reinhardt
Post by Gabe Black
luck that it works currently.
Post by Ali Saidi
Looking at your list I imagine you should be able to recover most
things
Post by Steve Reinhardt
Post by Gabe Black
from the dyninst, however I don't know if that is actually the case.
Excepted that the squashing mechanisms should be cleaned up, I'm not
sure
Post by Steve Reinhardt
Post by Gabe Black
how that is actually going to solve the problem. Don't we currently
send
Post by Steve Reinhardt
Post by Gabe Black
back the instruction? With the current instructions can't you figure
out
Post by Steve Reinhardt
Post by Gabe Black
the macro-op it belongs to?
Post by Ali Saidi
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several
Branch
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is
that
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
an instruction is started which returns from kernel to user level
and
is
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
microcoded. The instruction is fetched from the kernel's address
space
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
successfully and starts to execute, along the way dropping down to
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mode. Some microops later, there's some microop control flow which
O3
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mispredicts. When it squashes the mispredict and tries to restart,
it
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
first tries to refetch the instruction involved. Since it's now at
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
level and the instruction is on a kernel level only page, there's a
page
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
fault and things go downhill from there.
I partially implemented a solution to this before where O3
reinstates
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
the macroop it had been using when it restarts fetch. The problem
here
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
is that the path this kind of squash takes doesn't pass back the
right
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
information, and my attempts to fix that have been unsuccessful.
The
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
code that handles squashing in O3 is too complex, there's too much
going
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in all directions, it's not always very clear what affect a change
will
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to
clean
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need
two
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It
would
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
start at it if the instruction needed to be reexecuted due to a
memory
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
dependence violation, for instance, and would start after it for
faults,
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
interrupts, or branch mispredicts. Any other information that's
needed
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
like sequence numbers or actual control flow targets can be
retrieved
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
from the instructions where needed without having to split
everything
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't
think
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
I'll personally have a lot of time to dedicate to this at the very
least
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in the short term, but I wanted to get the conversation going so we
know
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2011-11-14 23:20:21 UTC
Permalink
I checked AMD and Intel's processor manuals. Both state that iret is a
serializing instruction, which means that iret will not be executed
speculatively. I would expect even the micro-ops are executed in a
non-speculative fashion.

--
Nilay
Post by Steve Reinhardt
That would be one solution. It would have some performance cost, but
depending on how often complex non-speculative macro-instructions get
executed, it might not be too bad.
Another question is whether it makes sense to dynamically predict internal
micro-branches with the same predictor we use for macro-instruction
branches. I honestly don't know how our processors do it, but I would not
be surprised if the dynamic predictor only worked on macro-instructions,
and micro-branches had some static hint bit or something like that. That
doesn't directly affect this bug (since you would still need recovery
regardless of how you predicted the micro-branch), but this discussion does
make me wonder if our model is realistic.
Steve
Post by Nilay
Well, I still don't get it. Do out-of-order CPUs speculate on iret? If
iret is to be executed non-speculatively, I would expect micro-ops that
are part of iret are executed non-speculatively.
--
Nilay
Post by Steve Reinhardt
Thanks for the more detailed explanation... that helped a lot. Sounds to
me like you're on the right track.
Steve
Post by Gabe Black
No, we're not trying to undo anything. An example might help. Lets look
at a dramatically simplified version of iret, the instruction that
returns from an interrupt handler. The microops might do the following.
1. Restore prior privilege level.
2. If we were in kernel level, skip to 4.
3. Restore user level stack.
4. End.
O3 fetches the bytes that go with iret, decodes that to a macroop, and
starts picking microops out of it. Microop 1 is executed and drops to
user level. Now microop 2 is executed, and O3 misspeculates that the
branch is taken (for example). The mispredict is detected, and later
microops in flight are squashed. O3 then attempts to restart where it
should have gone, microop 3.
Now, O3 looks at the PC involved and starts fetching the bytes which
become the macroop which the microops are pulled from. Because microop 1
successfully completed, the CPU is now at user level, but because the
iret is on a kernel page, it can't be accessed. The kernel gets a page
fault.
As I mentioned before, my partially implemented fix is to not only pass
back the PC, but to also pass back the macroop fetch should use instead
of making it refetch memory. The problem is that it's partially
implemented, and the way squashes work in O3 make it really tricky to
implement it properly, or to tell whether or not it's implemented properly.
Gabe
Post by Steve Reinhardt
I'd like to understand the issue a little better before commenting on
a
Post by Steve Reinhardt
solution.
Gabe, when you say "instruction" in your original description, do you
mean
Post by Steve Reinhardt
micro-op?
It seems to me that the fundamental problem is that we're trying to
undo
Post by Steve Reinhardt
the effects of a non-speculative micro-op, correct? So the solution
you're
Post by Steve Reinhardt
pursuing is that branch mispredictions only roll back to the offending
micro-op, and don't force the entire macro-op containing that micro-op
to
Post by Steve Reinhardt
re-execute?
Is this predicted control flow entirely internal to the macro-op? Or
is
Post by Steve Reinhardt
this an RFI where we are integrating the control transfer and the
privilege
Post by Steve Reinhardt
change? If it is the latter, why does the RFI need to get squashed at
all?
Post by Steve Reinhardt
Steve
Post by Gabe Black
Yes, this is an existing bug and the branch predictor just pokes
things
Post by Steve Reinhardt
Post by Gabe Black
in the right way to expose it. The macroop isn't passed back in this
particular case, and with the code the way it is, it's difficult to
even
Post by Steve Reinhardt
Post by Gabe Black
tell that that's the case, let alone how to fix it. Cleaning things
up
Post by Steve Reinhardt
Post by Gabe Black
won't fix the problem itself, but it will make fixing the actual
problem
Post by Steve Reinhardt
Post by Gabe Black
tractable.
Gabe
Post by Ali Saidi
I think this bug is just latently in the code right now and the
branch
Post by Steve Reinhardt
Post by Gabe Black
predictor change runs into it (this patch causes that branch to be
mispredicted). In any case I think the issue exists today and it's
just
Post by Steve Reinhardt
Post by Gabe Black
luck that it works currently.
Post by Ali Saidi
Looking at your list I imagine you should be able to recover most
things
Post by Steve Reinhardt
Post by Gabe Black
from the dyninst, however I don't know if that is actually the case.
Excepted that the squashing mechanisms should be cleaned up, I'm not
sure
Post by Steve Reinhardt
Post by Gabe Black
how that is actually going to solve the problem. Don't we currently
send
Post by Steve Reinhardt
Post by Gabe Black
back the instruction? With the current instructions can't you figure
out
Post by Steve Reinhardt
Post by Gabe Black
the macro-op it belongs to?
Post by Ali Saidi
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several
Branch
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is
that
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
an instruction is started which returns from kernel to user level
and
is
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
microcoded. The instruction is fetched from the kernel's address
space
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
successfully and starts to execute, along the way dropping down to
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mode. Some microops later, there's some microop control flow which
O3
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mispredicts. When it squashes the mispredict and tries to restart,
it
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
first tries to refetch the instruction involved. Since it's now at
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
level and the instruction is on a kernel level only page, there's a
page
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
fault and things go downhill from there.
I partially implemented a solution to this before where O3
reinstates
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
the macroop it had been using when it restarts fetch. The problem
here
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
is that the path this kind of squash takes doesn't pass back the
right
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
information, and my attempts to fix that have been unsuccessful.
The
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
code that handles squashing in O3 is too complex, there's too much
going
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in all directions, it's not always very clear what affect a change
will
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to
clean
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need
two
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It
would
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
start at it if the instruction needed to be reexecuted due to a
memory
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
dependence violation, for instance, and would start after it for
faults,
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
interrupts, or branch mispredicts. Any other information that's
needed
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
like sequence numbers or actual control flow targets can be
retrieved
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
from the instructions where needed without having to split
everything
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't
think
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
I'll personally have a lot of time to dedicate to this at the very
least
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in the short term, but I wanted to get the conversation going so we
know
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-11-15 09:20:58 UTC
Permalink
Serializing and being non-speculative are not the same thing and one
doesn't imply the other. The properties of the macroop do not apply to
all the microops. There's no reason at all to make an add microop in the
iret non-speculative. The microops which update state irreverseably are
nonspeculative, but being non-speculative doesn't matter here. The
microop which changes the mode wasn't misspeculated, it was supposed to
execute. In a real CPU, iret or any other instruction complicated enough
for internal control flow would probably execute out of the microcode
ROM, and then there wouldn't be any need to fetch the instruction again
either.

Gabe
Post by Nilay Vaish
I checked AMD and Intel's processor manuals. Both state that iret is a
serializing instruction, which means that iret will not be executed
speculatively. I would expect even the micro-ops are executed in a
non-speculative fashion.
--
Nilay
Post by Steve Reinhardt
That would be one solution. It would have some performance cost, but
depending on how often complex non-speculative macro-instructions get
executed, it might not be too bad.
Another question is whether it makes sense to dynamically predict internal
micro-branches with the same predictor we use for macro-instruction
branches. I honestly don't know how our processors do it, but I would not
be surprised if the dynamic predictor only worked on macro-instructions,
and micro-branches had some static hint bit or something like that.
That
doesn't directly affect this bug (since you would still need recovery
regardless of how you predicted the micro-branch), but this
discussion does
make me wonder if our model is realistic.
Steve
Post by Nilay
Well, I still don't get it. Do out-of-order CPUs speculate on iret? If
iret is to be executed non-speculatively, I would expect micro-ops that
are part of iret are executed non-speculatively.
--
Nilay
Post by Steve Reinhardt
Thanks for the more detailed explanation... that helped a lot.
Sounds to
me like you're on the right track.
Steve
Post by Gabe Black
No, we're not trying to undo anything. An example might help. Lets look
at a dramatically simplified version of iret, the instruction that
returns from an interrupt handler. The microops might do the following.
1. Restore prior privilege level.
2. If we were in kernel level, skip to 4.
3. Restore user level stack.
4. End.
O3 fetches the bytes that go with iret, decodes that to a macroop, and
starts picking microops out of it. Microop 1 is executed and drops to
user level. Now microop 2 is executed, and O3 misspeculates that the
branch is taken (for example). The mispredict is detected, and later
microops in flight are squashed. O3 then attempts to restart where it
should have gone, microop 3.
Now, O3 looks at the PC involved and starts fetching the bytes which
become the macroop which the microops are pulled from. Because microop 1
successfully completed, the CPU is now at user level, but because the
iret is on a kernel page, it can't be accessed. The kernel gets a page
fault.
As I mentioned before, my partially implemented fix is to not only pass
back the PC, but to also pass back the macroop fetch should use instead
of making it refetch memory. The problem is that it's partially
implemented, and the way squashes work in O3 make it really tricky to
implement it properly, or to tell whether or not it's implemented properly.
Gabe
Post by Steve Reinhardt
I'd like to understand the issue a little better before
commenting on
a
Post by Steve Reinhardt
solution.
Gabe, when you say "instruction" in your original description, do you
mean
Post by Steve Reinhardt
micro-op?
It seems to me that the fundamental problem is that we're trying to
undo
Post by Steve Reinhardt
the effects of a non-speculative micro-op, correct? So the solution
you're
Post by Steve Reinhardt
pursuing is that branch mispredictions only roll back to the offending
micro-op, and don't force the entire macro-op containing that micro-op
to
Post by Steve Reinhardt
re-execute?
Is this predicted control flow entirely internal to the
macro-op? Or
is
Post by Steve Reinhardt
this an RFI where we are integrating the control transfer and the
privilege
Post by Steve Reinhardt
change? If it is the latter, why does the RFI need to get squashed at
all?
Post by Steve Reinhardt
Steve
Post by Gabe Black
Yes, this is an existing bug and the branch predictor just pokes
things
Post by Steve Reinhardt
Post by Gabe Black
in the right way to expose it. The macroop isn't passed back in this
particular case, and with the code the way it is, it's difficult to
even
Post by Steve Reinhardt
Post by Gabe Black
tell that that's the case, let alone how to fix it. Cleaning things
up
Post by Steve Reinhardt
Post by Gabe Black
won't fix the problem itself, but it will make fixing the actual
problem
Post by Steve Reinhardt
Post by Gabe Black
tractable.
Gabe
Post by Ali Saidi
I think this bug is just latently in the code right now and the
branch
Post by Steve Reinhardt
Post by Gabe Black
predictor change runs into it (this patch causes that branch to be
mispredicted). In any case I think the issue exists today and it's
just
Post by Steve Reinhardt
Post by Gabe Black
luck that it works currently.
Post by Ali Saidi
Looking at your list I imagine you should be able to recover most
things
Post by Steve Reinhardt
Post by Gabe Black
from the dyninst, however I don't know if that is actually the case.
Excepted that the squashing mechanisms should be cleaned up, I'm not
sure
Post by Steve Reinhardt
Post by Gabe Black
how that is actually going to solve the problem. Don't we currently
send
Post by Steve Reinhardt
Post by Gabe Black
back the instruction? With the current instructions can't you figure
out
Post by Steve Reinhardt
Post by Gabe Black
the macro-op it belongs to?
Post by Ali Saidi
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several
Branch
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is
that
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
an instruction is started which returns from kernel to user level
and
is
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
microcoded. The instruction is fetched from the kernel's address
space
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
successfully and starts to execute, along the way dropping down to
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mode. Some microops later, there's some microop control flow which
O3
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mispredicts. When it squashes the mispredict and tries to restart,
it
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
first tries to refetch the instruction involved. Since it's now at
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
level and the instruction is on a kernel level only page, there's a
page
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
fault and things go downhill from there.
I partially implemented a solution to this before where O3
reinstates
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
the macroop it had been using when it restarts fetch. The problem
here
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
is that the path this kind of squash takes doesn't pass back the
right
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
information, and my attempts to fix that have been unsuccessful.
The
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
code that handles squashing in O3 is too complex, there's too much
going
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in all directions, it's not always very clear what affect a change
will
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to
clean
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need
two
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It
would
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
start at it if the instruction needed to be reexecuted due to a
memory
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
dependence violation, for instance, and would start after it for
faults,
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
interrupts, or branch mispredicts. Any other information that's
needed
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
like sequence numbers or actual control flow targets can be
retrieved
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
from the instructions where needed without having to split
everything
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't
think
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
I'll personally have a lot of time to dedicate to this at the very
least
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in the short term, but I wanted to get the conversation going so we
know
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-11-15 09:45:15 UTC
Permalink
Even then, marking those microops non-speculative wouldn't fix the
problem anyway. That would make O3 wait until they were at the head of
the commit queue before executing them, but any branch could still
easily be mispredicted. The mispredicted instruction wouldn't start
executing, but that doesn't matter since a squash would still happen and
fetch would still behave badly. You'd have to eliminate the need for
branch prediction and memory dependence prediction altogether so that
you'd never have to squash part of an iret, and that would mean allowing
only one microop in flight at a time. O3 doesn't know how to do that,
and if it did it would severely impact performance.

Gabe
Post by Gabe Black
Serializing and being non-speculative are not the same thing and one
doesn't imply the other. The properties of the macroop do not apply to
all the microops. There's no reason at all to make an add microop in the
iret non-speculative. The microops which update state irreverseably are
nonspeculative, but being non-speculative doesn't matter here. The
microop which changes the mode wasn't misspeculated, it was supposed to
execute. In a real CPU, iret or any other instruction complicated enough
for internal control flow would probably execute out of the microcode
ROM, and then there wouldn't be any need to fetch the instruction again
either.
Gabe
Post by Nilay Vaish
I checked AMD and Intel's processor manuals. Both state that iret is a
serializing instruction, which means that iret will not be executed
speculatively. I would expect even the micro-ops are executed in a
non-speculative fashion.
--
Nilay
Post by Steve Reinhardt
That would be one solution. It would have some performance cost, but
depending on how often complex non-speculative macro-instructions get
executed, it might not be too bad.
Another question is whether it makes sense to dynamically predict internal
micro-branches with the same predictor we use for macro-instruction
branches. I honestly don't know how our processors do it, but I would not
be surprised if the dynamic predictor only worked on macro-instructions,
and micro-branches had some static hint bit or something like that.
That
doesn't directly affect this bug (since you would still need recovery
regardless of how you predicted the micro-branch), but this
discussion does
make me wonder if our model is realistic.
Steve
Post by Nilay
Well, I still don't get it. Do out-of-order CPUs speculate on iret? If
iret is to be executed non-speculatively, I would expect micro-ops that
are part of iret are executed non-speculatively.
--
Nilay
Post by Steve Reinhardt
Thanks for the more detailed explanation... that helped a lot.
Sounds to
me like you're on the right track.
Steve
Post by Gabe Black
No, we're not trying to undo anything. An example might help. Lets look
at a dramatically simplified version of iret, the instruction that
returns from an interrupt handler. The microops might do the following.
1. Restore prior privilege level.
2. If we were in kernel level, skip to 4.
3. Restore user level stack.
4. End.
O3 fetches the bytes that go with iret, decodes that to a macroop, and
starts picking microops out of it. Microop 1 is executed and drops to
user level. Now microop 2 is executed, and O3 misspeculates that the
branch is taken (for example). The mispredict is detected, and later
microops in flight are squashed. O3 then attempts to restart where it
should have gone, microop 3.
Now, O3 looks at the PC involved and starts fetching the bytes which
become the macroop which the microops are pulled from. Because microop 1
successfully completed, the CPU is now at user level, but because the
iret is on a kernel page, it can't be accessed. The kernel gets a page
fault.
As I mentioned before, my partially implemented fix is to not only pass
back the PC, but to also pass back the macroop fetch should use instead
of making it refetch memory. The problem is that it's partially
implemented, and the way squashes work in O3 make it really tricky to
implement it properly, or to tell whether or not it's implemented properly.
Gabe
Post by Steve Reinhardt
I'd like to understand the issue a little better before
commenting on
a
Post by Steve Reinhardt
solution.
Gabe, when you say "instruction" in your original description, do you
mean
Post by Steve Reinhardt
micro-op?
It seems to me that the fundamental problem is that we're trying to
undo
Post by Steve Reinhardt
the effects of a non-speculative micro-op, correct? So the solution
you're
Post by Steve Reinhardt
pursuing is that branch mispredictions only roll back to the offending
micro-op, and don't force the entire macro-op containing that micro-op
to
Post by Steve Reinhardt
re-execute?
Is this predicted control flow entirely internal to the
macro-op? Or
is
Post by Steve Reinhardt
this an RFI where we are integrating the control transfer and the
privilege
Post by Steve Reinhardt
change? If it is the latter, why does the RFI need to get squashed at
all?
Post by Steve Reinhardt
Steve
Post by Gabe Black
Yes, this is an existing bug and the branch predictor just pokes
things
Post by Steve Reinhardt
Post by Gabe Black
in the right way to expose it. The macroop isn't passed back in this
particular case, and with the code the way it is, it's difficult to
even
Post by Steve Reinhardt
Post by Gabe Black
tell that that's the case, let alone how to fix it. Cleaning things
up
Post by Steve Reinhardt
Post by Gabe Black
won't fix the problem itself, but it will make fixing the actual
problem
Post by Steve Reinhardt
Post by Gabe Black
tractable.
Gabe
Post by Ali Saidi
I think this bug is just latently in the code right now and the
branch
Post by Steve Reinhardt
Post by Gabe Black
predictor change runs into it (this patch causes that branch to be
mispredicted). In any case I think the issue exists today and it's
just
Post by Steve Reinhardt
Post by Gabe Black
luck that it works currently.
Post by Ali Saidi
Looking at your list I imagine you should be able to recover most
things
Post by Steve Reinhardt
Post by Gabe Black
from the dyninst, however I don't know if that is actually the case.
Excepted that the squashing mechanisms should be cleaned up, I'm not
sure
Post by Steve Reinhardt
Post by Gabe Black
how that is actually going to solve the problem. Don't we currently
send
Post by Steve Reinhardt
Post by Gabe Black
back the instruction? With the current instructions can't you figure
out
Post by Steve Reinhardt
Post by Gabe Black
the macro-op it belongs to?
Post by Ali Saidi
Ali
Post by Gabe Black
Hey folks. Ali has had a change out for a while ("Fix several
Branch
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
Predictor issues") which improves branch predictor performance
substantially but breaks X86_FS on O3. It turns out the problem is
that
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
an instruction is started which returns from kernel to user level
and
is
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
microcoded. The instruction is fetched from the kernel's address
space
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
successfully and starts to execute, along the way dropping down to
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mode. Some microops later, there's some microop control flow which
O3
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
mispredicts. When it squashes the mispredict and tries to restart,
it
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
first tries to refetch the instruction involved. Since it's now at
user
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
level and the instruction is on a kernel level only page, there's a
page
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
fault and things go downhill from there.
I partially implemented a solution to this before where O3
reinstates
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
the macroop it had been using when it restarts fetch. The problem
here
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
is that the path this kind of squash takes doesn't pass back the
right
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
information, and my attempts to fix that have been unsuccessful.
The
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
code that handles squashing in O3 is too complex, there's too much
going
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in all directions, it's not always very clear what affect a change
will
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
have in unrelated situations, or which callsites are involved in a
particular type of fault.
To me, it seems like the first step in fixing this problem is to
clean
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
up how squashes are handled in O3 so that they can be made to
consistently handle squashes in non-restartable macroops.
Without having really dug into the specifics, I think we only need
two
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
pieces of information when squashing, a pointer to the guilty
instruction and whether execution should start at or after it. It
would
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
start at it if the instruction needed to be reexecuted due to a
memory
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
dependence violation, for instance, and would start after it for
faults,
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
interrupts, or branch mispredicts. Any other information that's
needed
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
like sequence numbers or actual control flow targets can be
retrieved
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
from the instructions where needed without having to split
everything
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
out and pass them around individually.
Is there any obvious problem with doing things this way? I don't
think
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
I'll personally have a lot of time to dedicate to this at the very
least
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
in the short term, but I wanted to get the conversation going so we
know
Post by Steve Reinhardt
Post by Gabe Black
Post by Ali Saidi
Post by Gabe Black
what to do when somebody has a chance to do it.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay
2011-11-15 14:00:36 UTC
Permalink
Gabe, I think I now understand the issue better. I think this problem may
still occur even if the prediction was correct. Is it not possible that
only parts of the macro-op were fetched before the mode was switched, in
which case fetch in user mode would still create problem?. I agree with
the solution that you suggested, because in your solution the macro-op
would not be fetched again.

Thanks for all the explanation!
Nilay
Post by Gabe Black
Even then, marking those microops non-speculative wouldn't fix the
problem anyway. That would make O3 wait until they were at the head of
the commit queue before executing them, but any branch could still
easily be mispredicted. The mispredicted instruction wouldn't start
executing, but that doesn't matter since a squash would still happen and
fetch would still behave badly. You'd have to eliminate the need for
branch prediction and memory dependence prediction altogether so that
you'd never have to squash part of an iret, and that would mean allowing
only one microop in flight at a time. O3 doesn't know how to do that,
and if it did it would severely impact performance.
Gabe
Post by Gabe Black
Serializing and being non-speculative are not the same thing and one
doesn't imply the other. The properties of the macroop do not apply to
all the microops. There's no reason at all to make an add microop in the
iret non-speculative. The microops which update state irreverseably are
nonspeculative, but being non-speculative doesn't matter here. The
microop which changes the mode wasn't misspeculated, it was supposed to
execute. In a real CPU, iret or any other instruction complicated enough
for internal control flow would probably execute out of the microcode
ROM, and then there wouldn't be any need to fetch the instruction again
either.
Gabe
Post by Nilay Vaish
I checked AMD and Intel's processor manuals. Both state that iret is a
serializing instruction, which means that iret will not be executed
speculatively. I would expect even the micro-ops are executed in a
non-speculative fashion.
--
Nilay
Post by Steve Reinhardt
That would be one solution. It would have some performance cost, but
depending on how often complex non-speculative macro-instructions get
executed, it might not be too bad.
Another question is whether it makes sense to dynamically predict internal
micro-branches with the same predictor we use for macro-instruction
branches. I honestly don't know how our processors do it, but I would not
be surprised if the dynamic predictor only worked on
macro-instructions,
and micro-branches had some static hint bit or something like that.
That
doesn't directly affect this bug (since you would still need recovery
regardless of how you predicted the micro-branch), but this
discussion does
make me wonder if our model is realistic.
Steve
Post by Nilay
Well, I still don't get it. Do out-of-order CPUs speculate on iret? If
iret is to be executed non-speculatively, I would expect micro-ops that
are part of iret are executed non-speculatively.
--
Nilay
Post by Steve Reinhardt
Thanks for the more detailed explanation... that helped a lot.
Sounds to
me like you're on the right track.
Steve
Post by Gabe Black
No, we're not trying to undo anything. An example might help. Lets look
at a dramatically simplified version of iret, the instruction that
returns from an interrupt handler. The microops might do the following.
1. Restore prior privilege level.
2. If we were in kernel level, skip to 4.
3. Restore user level stack.
4. End.
O3 fetches the bytes that go with iret, decodes that to a macroop, and
starts picking microops out of it. Microop 1 is executed and drops to
user level. Now microop 2 is executed, and O3 misspeculates that the
branch is taken (for example). The mispredict is detected, and later
microops in flight are squashed. O3 then attempts to restart where it
should have gone, microop 3.
Now, O3 looks at the PC involved and starts fetching the bytes which
become the macroop which the microops are pulled from. Because microop 1
successfully completed, the CPU is now at user level, but because the
iret is on a kernel page, it can't be accessed. The kernel gets a page
fault.
As I mentioned before, my partially implemented fix is to not only pass
back the PC, but to also pass back the macroop fetch should use instead
of making it refetch memory. The problem is that it's partially
implemented, and the way squashes work in O3 make it really tricky to
implement it properly, or to tell whether or not it's implemented properly.
Gabe
Post by Steve Reinhardt
I'd like to understand the issue a little better before
commenting on
a
Post by Steve Reinhardt
solution.
Gabe, when you say "instruction" in your original description, do you
mean
Post by Steve Reinhardt
micro-op?
It seems to me that the fundamental problem is that we're trying to
undo
Post by Steve Reinhardt
the effects of a non-speculative micro-op, correct? So the solution
you're
Post by Steve Reinhardt
pursuing is that branch mispredictions only roll back to the offending
micro-op, and don't force the entire macro-op containing that micro-op
to
Post by Steve Reinhardt
re-execute?
Is this predicted control flow entirely internal to the
macro-op? Or
is
Post by Steve Reinhardt
this an RFI where we are integrating the control transfer and the
privilege
Post by Steve Reinhardt
change? If it is the latter, why does the RFI need to get squashed at
all?
Post by Steve Reinhardt
Steve
Steve Reinhardt
2011-11-15 18:10:53 UTC
Permalink
Glad to see that Gabe and Nilay are coming to agreement... just for
posterity, I want to clear up a few things (though they probably don't have
Post by Nilay
Is it not possible that
only parts of the macro-op were fetched before the mode was switched, in
which case fetch in user mode would still create problem?
Since the micro-ops are generated directly from the macro-op, and only the
macro-op PC goes through the TLB, you always get all the micro-ops whenever
you fetch a macro-op. So in the sense that matters here, no, you can't
fetch only part of a macro-op.
Post by Nilay
Post by Gabe Black
Even then, marking those microops non-speculative wouldn't fix the
problem anyway. That would make O3 wait until they were at the head of
the commit queue before executing them, but any branch could still
easily be mispredicted.
If you didn't execute a micro-op until all the micro-ops before it were
ready to commit, than any earlier mis-predicted branches would be resolved
before execution began. So I think it would fix the problem (though I
agree with Gabe that it's not a desirable fix). This would also be tricky
to implement; since micro-ops are only committed in bulk when a macro-op
completes, micro-ops in the middle o a macro-op never really reach "the
head" of the commit queue.
Post by Nilay
Post by Gabe Black
Post by Gabe Black
Serializing and being non-speculative are not the same thing and one
doesn't imply the other.
I don't agree (though it doesn't really affect this discussion).
Serialization does imply non-speculative execution, as I outlined above:
if every instruction before you has completed, then none of them could
still be in a speculative state to cause you any problems. The opposite is
not true; it is possible to have non-speculative execution that does not
imply serialization (though it's complicated, and in practice
non-speculative execution often is achieved via serialization).
Post by Nilay
The properties of the macroop do not apply to
Post by Gabe Black
Post by Gabe Black
all the microops.
Definitely true.

Steve

Continue reading on narkive:
Loading...