Discussion:
[gem5-dev] Data dependency caused by flags
(too old to reply)
Watanabe, Yasuko
2012-04-05 05:20:54 UTC
Permalink
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a much lower IPC than the theoretical IPC. The issue seems to be data dependencies caused by (control) flags, not registers, and I am wondering if anyone has come across the same issue.

The microbenchmark has many data independent ADD instructions (http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/general_purpose/arithmetic/add_and_subtract.py#l41) in a loop. On a 2-wide out-of-order machine with enough resources, the IPC should be two at a steady stated. However, the IPC only goes up to one. What is happening is that even though the ADDs have two source and one destination registers and a flag to set in x86, gem5 adds one extra flag source register to the ADDs. As a result, each ADD becomes dependent on the earlier ADD's destination flag, constraining the achievable IPC to one.

Here is an example sequence with physical register mappings:
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90
...

Physical registers 98, 9, and 92 are ready when those two ADDs are renamed; however, as you can see, the second ADD has to wait for the first ADD because of the extra flag source register S3. When I removed those flags in the macroop definition, the IPC jumped up from 1 to 1.7.

Does anyone know why the ADD has to read the flags, even though the x86 manual does not say that? Those flags should just cause write-after-write dependency, not read-after-write.

Yasuko
Nilay Vaish
2012-04-05 10:34:31 UTC
Permalink
The code for the function genFlags() in src/arch/x86/insts/microregop.cc
suggests that the values of flag bits not updated by the ADD instruction
need to be retained. This means that the previous values need to be read
and written again, which means the second ADD can be dependent on a value
written by the first ADD. If the dependencies were evaulated at bit level,
then these instructions would not be dependent.

--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a
much lower IPC than the theoretical IPC. The issue seems to be data
dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/general_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources, the
IPC should be two at a steady stated. However, the IPC only goes up to
one. What is happening is that even though the ADDs have two source and
one destination registers and a flag to set in x86, gem5 adds one extra
flag source register to the ADDs. As a result, each ADD becomes
dependent on the earlier ADD's destination flag, constraining the
achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90
...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I removed
those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the x86
manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Watanabe, Yasuko
2012-04-06 00:18:44 UTC
Permalink
Nilay,

I agree with you. I think the dependencies of those flag bits should be evaluated at bit level.

Gabe and others,

This change seems invasive. Do you know the best way to handle this?

Yasuko

-----Original Message-----
From: gem5-dev-bounces-1Gs4CP2/***@public.gmane.org [mailto:gem5-dev-bounces-1Gs4CP2/***@public.gmane.org] On Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags

The code for the function genFlags() in src/arch/x86/insts/microregop.cc suggests that the values of flag bits not updated by the ADD instruction need to be retained. This means that the previous values need to be read and written again, which means the second ADD can be dependent on a value written by the first ADD. If the dependencies were evaulated at bit level, then these instructions would not be dependent.

--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a
much lower IPC than the theoretical IPC. The issue seems to be data
dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/ge
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources, the
IPC should be two at a steady stated. However, the IPC only goes up to
one. What is happening is that even though the ADDs have two source
and one destination registers and a flag to set in x86, gem5 adds one
extra flag source register to the ADDs. As a result, each ADD becomes
dependent on the earlier ADD's destination flag, constraining the
achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I removed
those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2012-04-06 01:12:14 UTC
Permalink
Yes, you guys are right. This is a recognized problem, and I've made
some changes over time which should make it easier to do this like a
real x86 CPU would. I haven't yet, but it's on the horizon. I tend to be
very busy, although circumstances may mean I have a little more or less
time than normal for a little while so I don't know for sure when I'll
get it fixed. If you have an idea of how to get it to do what you want
locally, feel free. That will get you going, and when I get it fixed for
real then you can start using that.

Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be evaluated at bit level.
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in src/arch/x86/insts/microregop.cc suggests that the values of flag bits not updated by the ADD instruction need to be retained. This means that the previous values need to be read and written again, which means the second ADD can be dependent on a value written by the first ADD. If the dependencies were evaulated at bit level, then these instructions would not be dependent.
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a
much lower IPC than the theoretical IPC. The issue seems to be data
dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/ge
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources, the
IPC should be two at a steady stated. However, the IPC only goes up to
one. What is happening is that even though the ADDs have two source
and one destination registers and a flag to set in x86, gem5 adds one
extra flag source register to the ADDs. As a result, each ADD becomes
dependent on the earlier ADD's destination flag, constraining the
achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I removed
those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Watanabe, Yasuko
2012-04-06 04:10:18 UTC
Permalink
Hi Gabe,

Do you already have an idea of how to fix this? If so, can you give me some pointers?

Yasuko

-----Original Message-----
From: gem5-dev-bounces-1Gs4CP2/***@public.gmane.org [mailto:gem5-dev-bounces-1Gs4CP2/***@public.gmane.org] On Behalf Of Gabe Black
Sent: Thursday, April 05, 2012 6:12 PM
To: gem5-dev-1Gs4CP2/***@public.gmane.org
Subject: Re: [gem5-dev] Data dependency caused by flags

Yes, you guys are right. This is a recognized problem, and I've made some changes over time which should make it easier to do this like a real x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very busy, although circumstances may mean I have a little more or less time than normal for a little while so I don't know for sure when I'll get it fixed. If you have an idea of how to get it to do what you want locally, feel free. That will get you going, and when I get it fixed for real then you can start using that.

Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be evaluated at bit level.
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in src/arch/x86/insts/microregop.cc suggests that the values of flag bits not updated by the ADD instruction need to be retained. This means that the previous values need to be read and written again, which means the second ADD can be dependent on a value written by the first ADD. If the dependencies were evaulated at bit level, then these instructions would not be dependent.
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a
much lower IPC than the theoretical IPC. The issue seems to be data
dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2012-04-06 04:48:40 UTC
Permalink
I would be surprised if a processor actually carries out bit-level
dependency checks, it seems expensive to me. I would rather have two
separate registers,

* one that contains the status flags (OF, SF, ZF, AF, PF, CF) which I
think depend only on the current instruction under execution.

* the other one would contain the system flags which will be modified by
only very few instructions, even if most instructions read it.

--
Nilay
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me some pointers?
Yasuko
-----Original Message-----
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made
some changes over time which should make it easier to do this like a
real x86 CPU would. I haven't yet, but it's on the horizon. I tend to be
very busy, although circumstances may mean I have a little more or less
time than normal for a little while so I don't know for sure when I'll
get it fixed. If you have an idea of how to get it to do what you want
locally, feel free. That will get you going, and when I get it fixed for
real then you can start using that.
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be evaluated at bit level.
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in src/arch/x86/insts/microregop.cc suggests that the values of flag bits not updated by the ADD instruction need to be retained. This means that the previous values need to be read and written again, which means the second ADD can be dependent on a value written by the first ADD. If the dependencies were evaulated at bit level, then these instructions would not be dependent.
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a
much lower IPC than the theoretical IPC. The issue seems to be data
dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2012-04-06 19:14:47 UTC
Permalink
That's what we already have and isn't sufficient. Having every bit
broken out isn't what processors do, but they do break it down into a
group of condition cdes and then a few individuals.

Gabe
Post by Nilay Vaish
I would be surprised if a processor actually carries out bit-level
dependency checks, it seems expensive to me. I would rather have two
separate registers,
* one that contains the status flags (OF, SF, ZF, AF, PF, CF) which I
think depend only on the current instruction under execution.
* the other one would contain the system flags which will be modified
by only very few instructions, even if most instructions read it.
--
Nilay
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me some pointers?
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made
some changes over time which should make it easier to do this like a
real x86 CPU would. I haven't yet, but it's on the horizon. I tend to
be very busy, although circumstances may mean I have a little more or
less time than normal for a little while so I don't know for sure
when I'll get it fixed. If you have an idea of how to get it to do
what you want locally, feel free. That will get you going, and when I
get it fixed for real then you can start using that.
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should
be evaluated at bit level.
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in
src/arch/x86/insts/microregop.cc suggests that the values of flag
bits not updated by the ADD instruction need to be retained. This
means that the previous values need to be read and written again,
which means the second ADD can be dependent on a value written by
the first ADD. If the dependencies were evaulated at bit level, then
these instructions would not be dependent.
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a
much lower IPC than the theoretical IPC. The issue seems to be data
dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2012-04-06 19:17:35 UTC
Permalink
It's complicated. Looking at it again I reminded myself of all the ways
it doesn't fit into the way the ISA parser does things, so it's going to
quite a bit of work to fix properly. I don't have any ideas for how to
make it much simpler that would be at all practical.

Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me some pointers?
Yasuko
-----Original Message-----
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made some changes over time which should make it easier to do this like a real x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very busy, although circumstances may mean I have a little more or less time than normal for a little while so I don't know for sure when I'll get it fixed. If you have an idea of how to get it to do what you want locally, feel free. That will get you going, and when I get it fixed for real then you can start using that.
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be evaluated at bit level.
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in src/arch/x86/insts/microregop.cc suggests that the values of flag bits not updated by the ADD instruction need to be retained. This means that the previous values need to be read and written again, which means the second ADD can be dependent on a value written by the first ADD. If the dependencies were evaulated at bit level, then these instructions would not be dependent.
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a
much lower IPC than the theoretical IPC. The issue seems to be data
dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Watanabe, Yasuko
2012-04-06 22:43:29 UTC
Permalink
Hi Gabe,

I also went through the code and got a sense of changes that need to be made. You are right. The current infrastructure makes it difficult to fix this issue.

Yasuko

-----Original Message-----
From: gem5-dev-bounces-1Gs4CP2/***@public.gmane.org [mailto:gem5-dev-bounces-1Gs4CP2/***@public.gmane.org] On Behalf Of Gabe Black
Sent: Friday, April 06, 2012 12:18 PM
To: gem5-dev-1Gs4CP2/***@public.gmane.org
Subject: Re: [gem5-dev] Data dependency caused by flags

It's complicated. Looking at it again I reminded myself of all the ways it doesn't fit into the way the ISA parser does things, so it's going to quite a bit of work to fix properly. I don't have any ideas for how to make it much simpler that would be at all practical.

Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me some pointers?
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made some changes over time which should make it easier to do this like a real x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very busy, although circumstances may mean I have a little more or less time than normal for a little while so I don't know for sure when I'll get it fixed. If you have an idea of how to get it to do what you want locally, feel free. That will get you going, and when I get it fixed for real then you can start using that.
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be evaluated at bit level.
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in src/arch/x86/insts/microregop.cc suggests that the values of flag bits not updated by the ADD instruction need to be retained. This means that the previous values need to be read and written again, which means the second ADD can be dependent on a value written by the first ADD. If the dependencies were evaulated at bit level, then these instructions would not be dependent.
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got
a much lower IPC than the theoretical IPC. The issue seems to be
data dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/
g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2012-04-07 18:32:29 UTC
Permalink
Hi Gabe,

Your earlier email said "I've made some changes over time which should make
it easier to do this like a real x86 CPU would". Could you expand on that?
It sounded like you had some sort of plan or direction at least. If we're
going to start working on this ourselves, it would be best if we can
benefit from whatever insights you've had or preliminary work you've done.

I see your later email says "I don't have any ideas for how to make it much
simpler", but that seems to contradict what you said at first. In
particular, you also earlier said "If you have an idea of how to get it to
do what you want locally, feel free. That will get you going, and when I
get it fixed for real then you can start using that.". I'd like to
explicitly reject that idea... for one thing, I'm not sure what a "local"
solution would look like, and more importantly, this issue seems
complicated enough that us doing some sort of temporary or stopgap solution
like you're implying, only to throw it away once you've done it "for real",
seems like a huge waste of effort. So overall I'd like to be sure we're in
sync with whatever you're thinking to make sure that our efforts are
additive and complementary and not redundant.

Thanks,

Steve
Post by Watanabe, Yasuko
Hi Gabe,
I also went through the code and got a sense of changes that need to be
made. You are right. The current infrastructure makes it difficult to fix
this issue.
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Friday, April 06, 2012 12:18 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
It's complicated. Looking at it again I reminded myself of all the ways it
doesn't fit into the way the ISA parser does things, so it's going to quite
a bit of work to fix properly. I don't have any ideas for how to make it
much simpler that would be at all practical.
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me
some pointers?
Post by Watanabe, Yasuko
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made
some changes over time which should make it easier to do this like a real
x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very
busy, although circumstances may mean I have a little more or less time
than normal for a little while so I don't know for sure when I'll get it
fixed. If you have an idea of how to get it to do what you want locally,
feel free. That will get you going, and when I get it fixed for real then
you can start using that.
Post by Watanabe, Yasuko
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be
evaluated at bit level.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in
src/arch/x86/insts/microregop.cc suggests that the values of flag bits not
updated by the ADD instruction need to be retained. This means that the
previous values need to be read and written again, which means the second
ADD can be dependent on a value written by the first ADD. If the
dependencies were evaulated at bit level, then these instructions would not
be dependent.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got
a much lower IPC than the theoretical IPC. The issue seems to be
data dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/
g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from
1 to 1.7.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2012-04-08 01:19:14 UTC
Permalink
Yeah, I think we've talked about this topic in the past, but it was a
while ago and I don't remember exactly what all we talked about or the
conclusion(s) we reached.

The problem at the ISA level is that there are lots of instructions in
x86 which are pretty basic and used a lot (adds, subtracts, etc.) which
compute condition codes every time in case you need them. That, combined
with the fact that the instructions which update the condition codes
update somewhat erratic combinations of bits, means that lots of
instructions write the condition code bits, and those same common
instructions read them too so they can do a partial update.

This has happened to a lesser extent before where there are control like
bits and condition code like bits in the same register. To my knowledge
that's happened at least on SPARC, ARM, and x86. That's dealt with by
splitting the condition code bits out into their own register, which is
treated as a renamed integer register, and the control bits which are
treated as a misc reg with all the overhead and special precautions.
That doesn't entirely work on x86, though, because even among the
condition code bits there are a lot of partial accesses as described
above. The cc bits could be broken down into individual bits, but that's
pretty cumbersome since there are, including the two artificial ones for
microcode, 8 of them I believe? That would be a lot of registers to
rename, would slow down the simulator, wouldn't be that realistic, etc.
What real CPUs do, after talking to someone in the know at AMD, is that
they gather up one group of flags, about 4 if I recall, and treat those
as a unit. The others are handled individually. The group of 4 is still
not 100% treated as a unit since some instructions modify just one of
them, for instance, but it's pretty close, optimizes for the common
case, and the odd cases can still work like they do today.

The difficulty implementing this is that exactly which condition code
bits to set and which to check for conditional microops are decided at
the microcode level and are arbitrary combinations. They don't need to
be completely arbitrary, but that means that microops really effectively
know which, how many, etc., condition code registers they need at
construction time as apposed to compile time. So what we'd need to do is
to allow the constructor for a microop to look at the flags it was being
given and to use that to more programatically figure out which registers
it had as sources or destinations, and how many. The body of the
instructions themselves would need to be sophisticated enough to pull
together the different source registers, whatever they are, and to
process them appropriately with a consistent bit of code (and not 18
different parameters to some function where 14 aren't used at any
particular time). It would also have to know how to split things back up
again when writing out the results.

What I did to move us a little bit in this direction is to make the
types of operands much more flexible so that we can have structures,
typedefs, etc. What we'd still need is truely composite operand types
where a single operand, for instance the condition code bits, is built
from a set of registers (determined in some way appropriate to the
operand) and/or written back to a set of registers, but which could be
handled easily as a single value inside the code blob. Then we can avoid
having 100(s) of versions of microops for all the different combinations
of flag bits, which would be a terrible thing to have to live with.

As far as easier ways to deal with this, there is only one which is what
I was alluding to in what I think was my earliest email, and that's to
just hack around it so the instructions you know you're using in the
performance sensitive part behave incorrectly generally speaking, but do
what you expect for the benchmark. Maybe they'd even have to know where
they were running from, that they were in a range of ticks, etc. A gross
and terrible hack unfit to check in, but something that would get the
poster unstuck for now. Doing things the "right" way will take some
infrastructure work, and that may not be very quick. I don't think
there's any real shortcut around doing the infrastructure work that
doesn't have a pretty heavy cost (like blowing up the number of microop
classes 100 fold).

Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Your earlier email said "I've made some changes over time which should make
it easier to do this like a real x86 CPU would". Could you expand on that?
It sounded like you had some sort of plan or direction at least. If we're
going to start working on this ourselves, it would be best if we can
benefit from whatever insights you've had or preliminary work you've done.
I see your later email says "I don't have any ideas for how to make it much
simpler", but that seems to contradict what you said at first. In
particular, you also earlier said "If you have an idea of how to get it to
do what you want locally, feel free. That will get you going, and when I
get it fixed for real then you can start using that.". I'd like to
explicitly reject that idea... for one thing, I'm not sure what a "local"
solution would look like, and more importantly, this issue seems
complicated enough that us doing some sort of temporary or stopgap solution
like you're implying, only to throw it away once you've done it "for real",
seems like a huge waste of effort. So overall I'd like to be sure we're in
sync with whatever you're thinking to make sure that our efforts are
additive and complementary and not redundant.
Thanks,
Steve
Post by Watanabe, Yasuko
Hi Gabe,
I also went through the code and got a sense of changes that need to be
made. You are right. The current infrastructure makes it difficult to fix
this issue.
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Friday, April 06, 2012 12:18 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
It's complicated. Looking at it again I reminded myself of all the ways it
doesn't fit into the way the ISA parser does things, so it's going to quite
a bit of work to fix properly. I don't have any ideas for how to make it
much simpler that would be at all practical.
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me
some pointers?
Post by Watanabe, Yasuko
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made
some changes over time which should make it easier to do this like a real
x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very
busy, although circumstances may mean I have a little more or less time
than normal for a little while so I don't know for sure when I'll get it
fixed. If you have an idea of how to get it to do what you want locally,
feel free. That will get you going, and when I get it fixed for real then
you can start using that.
Post by Watanabe, Yasuko
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be
evaluated at bit level.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in
src/arch/x86/insts/microregop.cc suggests that the values of flag bits not
updated by the ADD instruction need to be retained. This means that the
previous values need to be read and written again, which means the second
ADD can be dependent on a value written by the first ADD. If the
dependencies were evaulated at bit level, then these instructions would not
be dependent.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got
a much lower IPC than the theoretical IPC. The issue seems to be
data dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/
g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from
1 to 1.7.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2012-04-08 05:38:12 UTC
Permalink
It seems to me if we can cover the following cases most of the
instructions would get covered --
1. All six condition codes -- (OF, SF, ZF, AF, PF, CF)
2. Two classes of five condition codes -- (OF,SF,ZF,PF,CF), (OF, SF, ZF,
AF, PF)
3. One class of two condition codes -- (OF,CF)

Yasuko's current problem about ADD instructions will get solved if we just
handle the first case i.e. specify that if an instruction is writing all
the condition codes, then do not assume condition code register to be a
source register.

--
Nilay
Post by Gabe Black
Yeah, I think we've talked about this topic in the past, but it was a
while ago and I don't remember exactly what all we talked about or the
conclusion(s) we reached.
The problem at the ISA level is that there are lots of instructions in
x86 which are pretty basic and used a lot (adds, subtracts, etc.) which
compute condition codes every time in case you need them. That, combined
with the fact that the instructions which update the condition codes
update somewhat erratic combinations of bits, means that lots of
instructions write the condition code bits, and those same common
instructions read them too so they can do a partial update.
This has happened to a lesser extent before where there are control like
bits and condition code like bits in the same register. To my knowledge
that's happened at least on SPARC, ARM, and x86. That's dealt with by
splitting the condition code bits out into their own register, which is
treated as a renamed integer register, and the control bits which are
treated as a misc reg with all the overhead and special precautions.
That doesn't entirely work on x86, though, because even among the
condition code bits there are a lot of partial accesses as described
above. The cc bits could be broken down into individual bits, but that's
pretty cumbersome since there are, including the two artificial ones for
microcode, 8 of them I believe? That would be a lot of registers to
rename, would slow down the simulator, wouldn't be that realistic, etc.
What real CPUs do, after talking to someone in the know at AMD, is that
they gather up one group of flags, about 4 if I recall, and treat those
as a unit. The others are handled individually. The group of 4 is still
not 100% treated as a unit since some instructions modify just one of
them, for instance, but it's pretty close, optimizes for the common
case, and the odd cases can still work like they do today.
The difficulty implementing this is that exactly which condition code
bits to set and which to check for conditional microops are decided at
the microcode level and are arbitrary combinations. They don't need to
be completely arbitrary, but that means that microops really effectively
know which, how many, etc., condition code registers they need at
construction time as apposed to compile time. So what we'd need to do is
to allow the constructor for a microop to look at the flags it was being
given and to use that to more programatically figure out which registers
it had as sources or destinations, and how many. The body of the
instructions themselves would need to be sophisticated enough to pull
together the different source registers, whatever they are, and to
process them appropriately with a consistent bit of code (and not 18
different parameters to some function where 14 aren't used at any
particular time). It would also have to know how to split things back up
again when writing out the results.
What I did to move us a little bit in this direction is to make the
types of operands much more flexible so that we can have structures,
typedefs, etc. What we'd still need is truely composite operand types
where a single operand, for instance the condition code bits, is built
from a set of registers (determined in some way appropriate to the
operand) and/or written back to a set of registers, but which could be
handled easily as a single value inside the code blob. Then we can avoid
having 100(s) of versions of microops for all the different combinations
of flag bits, which would be a terrible thing to have to live with.
As far as easier ways to deal with this, there is only one which is what
I was alluding to in what I think was my earliest email, and that's to
just hack around it so the instructions you know you're using in the
performance sensitive part behave incorrectly generally speaking, but do
what you expect for the benchmark. Maybe they'd even have to know where
they were running from, that they were in a range of ticks, etc. A gross
and terrible hack unfit to check in, but something that would get the
poster unstuck for now. Doing things the "right" way will take some
infrastructure work, and that may not be very quick. I don't think
there's any real shortcut around doing the infrastructure work that
doesn't have a pretty heavy cost (like blowing up the number of microop
classes 100 fold).
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Your earlier email said "I've made some changes over time which should make
it easier to do this like a real x86 CPU would". Could you expand on that?
It sounded like you had some sort of plan or direction at least. If we're
going to start working on this ourselves, it would be best if we can
benefit from whatever insights you've had or preliminary work you've done.
I see your later email says "I don't have any ideas for how to make it much
simpler", but that seems to contradict what you said at first. In
particular, you also earlier said "If you have an idea of how to get it to
do what you want locally, feel free. That will get you going, and when I
get it fixed for real then you can start using that.". I'd like to
explicitly reject that idea... for one thing, I'm not sure what a "local"
solution would look like, and more importantly, this issue seems
complicated enough that us doing some sort of temporary or stopgap solution
like you're implying, only to throw it away once you've done it "for real",
seems like a huge waste of effort. So overall I'd like to be sure we're in
sync with whatever you're thinking to make sure that our efforts are
additive and complementary and not redundant.
Thanks,
Steve
Post by Watanabe, Yasuko
Hi Gabe,
I also went through the code and got a sense of changes that need to be
made. You are right. The current infrastructure makes it difficult to fix
this issue.
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Friday, April 06, 2012 12:18 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
It's complicated. Looking at it again I reminded myself of all the ways it
doesn't fit into the way the ISA parser does things, so it's going to quite
a bit of work to fix properly. I don't have any ideas for how to make it
much simpler that would be at all practical.
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me
some pointers?
Post by Watanabe, Yasuko
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made
some changes over time which should make it easier to do this like a real
x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very
busy, although circumstances may mean I have a little more or less time
than normal for a little while so I don't know for sure when I'll get it
fixed. If you have an idea of how to get it to do what you want locally,
feel free. That will get you going, and when I get it fixed for real then
you can start using that.
Post by Watanabe, Yasuko
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be
evaluated at bit level.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in
src/arch/x86/insts/microregop.cc suggests that the values of flag bits not
updated by the ADD instruction need to be retained. This means that the
previous values need to be read and written again, which means the second
ADD can be dependent on a value written by the first ADD. If the
dependencies were evaulated at bit level, then these instructions would not
be dependent.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got
a much lower IPC than the theoretical IPC. The issue seems to be
data dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/
g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from
1 to 1.7.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2012-04-08 08:11:44 UTC
Permalink
I don't think that's true, although I'm not willing to trawl though the
ISA manual to determine one way or the other. You'll need to just trust
me, since if AMD once implemented chips that way they intended to sell,
I doubt they just picked something from a hat. The group of four which
are considered together are I think ZAPS, zero, auxiliary carry (? it's
been a while), parity and sign, leaving overflow, carry, and the
artificial emulation zero and emulation carry flags. That's five groups
which is reasonable as far as having registers to rename, etc., but
still having a full cross product of combinations would be 64 variants.
I did look through the entire ISA manual's instruction listing once a
couple of years ago when I was researching this, and I think there are
only a small handful of instructions which behave badly with these
groups and, say, write to only the zero flag.

Gabe
Post by Nilay Vaish
It seems to me if we can cover the following cases most of the
instructions would get covered --
1. All six condition codes -- (OF, SF, ZF, AF, PF, CF)
2. Two classes of five condition codes -- (OF,SF,ZF,PF,CF), (OF, SF,
ZF, AF, PF)
3. One class of two condition codes -- (OF,CF)
Yasuko's current problem about ADD instructions will get solved if we
just handle the first case i.e. specify that if an instruction is
writing all the condition codes, then do not assume condition code
register to be a source register.
--
Nilay
Post by Gabe Black
Yeah, I think we've talked about this topic in the past, but it was a
while ago and I don't remember exactly what all we talked about or the
conclusion(s) we reached.
The problem at the ISA level is that there are lots of instructions in
x86 which are pretty basic and used a lot (adds, subtracts, etc.) which
compute condition codes every time in case you need them. That, combined
with the fact that the instructions which update the condition codes
update somewhat erratic combinations of bits, means that lots of
instructions write the condition code bits, and those same common
instructions read them too so they can do a partial update.
This has happened to a lesser extent before where there are control like
bits and condition code like bits in the same register. To my knowledge
that's happened at least on SPARC, ARM, and x86. That's dealt with by
splitting the condition code bits out into their own register, which is
treated as a renamed integer register, and the control bits which are
treated as a misc reg with all the overhead and special precautions.
That doesn't entirely work on x86, though, because even among the
condition code bits there are a lot of partial accesses as described
above. The cc bits could be broken down into individual bits, but that's
pretty cumbersome since there are, including the two artificial ones for
microcode, 8 of them I believe? That would be a lot of registers to
rename, would slow down the simulator, wouldn't be that realistic, etc.
What real CPUs do, after talking to someone in the know at AMD, is that
they gather up one group of flags, about 4 if I recall, and treat those
as a unit. The others are handled individually. The group of 4 is still
not 100% treated as a unit since some instructions modify just one of
them, for instance, but it's pretty close, optimizes for the common
case, and the odd cases can still work like they do today.
The difficulty implementing this is that exactly which condition code
bits to set and which to check for conditional microops are decided at
the microcode level and are arbitrary combinations. They don't need to
be completely arbitrary, but that means that microops really effectively
know which, how many, etc., condition code registers they need at
construction time as apposed to compile time. So what we'd need to do is
to allow the constructor for a microop to look at the flags it was being
given and to use that to more programatically figure out which registers
it had as sources or destinations, and how many. The body of the
instructions themselves would need to be sophisticated enough to pull
together the different source registers, whatever they are, and to
process them appropriately with a consistent bit of code (and not 18
different parameters to some function where 14 aren't used at any
particular time). It would also have to know how to split things back up
again when writing out the results.
What I did to move us a little bit in this direction is to make the
types of operands much more flexible so that we can have structures,
typedefs, etc. What we'd still need is truely composite operand types
where a single operand, for instance the condition code bits, is built
from a set of registers (determined in some way appropriate to the
operand) and/or written back to a set of registers, but which could be
handled easily as a single value inside the code blob. Then we can avoid
having 100(s) of versions of microops for all the different combinations
of flag bits, which would be a terrible thing to have to live with.
As far as easier ways to deal with this, there is only one which is what
I was alluding to in what I think was my earliest email, and that's to
just hack around it so the instructions you know you're using in the
performance sensitive part behave incorrectly generally speaking, but do
what you expect for the benchmark. Maybe they'd even have to know where
they were running from, that they were in a range of ticks, etc. A gross
and terrible hack unfit to check in, but something that would get the
poster unstuck for now. Doing things the "right" way will take some
infrastructure work, and that may not be very quick. I don't think
there's any real shortcut around doing the infrastructure work that
doesn't have a pretty heavy cost (like blowing up the number of microop
classes 100 fold).
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Your earlier email said "I've made some changes over time which should make
it easier to do this like a real x86 CPU would". Could you expand on that?
It sounded like you had some sort of plan or direction at least.
If we're
going to start working on this ourselves, it would be best if we can
benefit from whatever insights you've had or preliminary work you've done.
I see your later email says "I don't have any ideas for how to make it much
simpler", but that seems to contradict what you said at first. In
particular, you also earlier said "If you have an idea of how to get it to
do what you want locally, feel free. That will get you going, and when I
get it fixed for real then you can start using that.". I'd like to
explicitly reject that idea... for one thing, I'm not sure what a "local"
solution would look like, and more importantly, this issue seems
complicated enough that us doing some sort of temporary or stopgap solution
like you're implying, only to throw it away once you've done it "for real",
seems like a huge waste of effort. So overall I'd like to be sure we're in
sync with whatever you're thinking to make sure that our efforts are
additive and complementary and not redundant.
Thanks,
Steve
On Fri, Apr 6, 2012 at 3:43 PM, Watanabe, Yasuko
Post by Watanabe, Yasuko
Hi Gabe,
I also went through the code and got a sense of changes that need to be
made. You are right. The current infrastructure makes it difficult to fix
this issue.
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Friday, April 06, 2012 12:18 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
It's complicated. Looking at it again I reminded myself of all the ways it
doesn't fit into the way the ISA parser does things, so it's going to quite
a bit of work to fix properly. I don't have any ideas for how to make it
much simpler that would be at all practical.
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me
some pointers?
Post by Watanabe, Yasuko
Yasuko
-----Original Message-----
Behalf Of Gabe Black
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made
some changes over time which should make it easier to do this like a real
x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very
busy, although circumstances may mean I have a little more or less time
than normal for a little while so I don't know for sure when I'll get it
fixed. If you have an idea of how to get it to do what you want locally,
feel free. That will get you going, and when I get it fixed for real then
you can start using that.
Post by Watanabe, Yasuko
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be
evaluated at bit level.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in
src/arch/x86/insts/microregop.cc suggests that the values of flag bits not
updated by the ADD instruction need to be retained. This means that the
previous values need to be read and written again, which means the second
ADD can be dependent on a value written by the first ADD. If the
dependencies were evaulated at bit level, then these instructions would not
be dependent.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got
a much lower IPC than the theoretical IPC. The issue seems to be
data dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/
g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from
1 to 1.7.
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Post by Watanabe, Yasuko
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2012-04-09 15:06:03 UTC
Permalink
I wrote what I observed from the output of following command -
grep "flags=" src/arch/x86/isa/insts/ -rn

Even with this output we can use (OF,SF,ZF,PF) as one group.
Can you elaborate on why we need those two extra bits, ECF and EZF?

--
Nilay
Post by Gabe Black
I don't think that's true, although I'm not willing to trawl though the
ISA manual to determine one way or the other. You'll need to just trust
me, since if AMD once implemented chips that way they intended to sell,
I doubt they just picked something from a hat. The group of four which
are considered together are I think ZAPS, zero, auxiliary carry (? it's
been a while), parity and sign, leaving overflow, carry, and the
artificial emulation zero and emulation carry flags. That's five groups
which is reasonable as far as having registers to rename, etc., but
still having a full cross product of combinations would be 64 variants.
I did look through the entire ISA manual's instruction listing once a
couple of years ago when I was researching this, and I think there are
only a small handful of instructions which behave badly with these
groups and, say, write to only the zero flag.
Gabe
Post by Nilay Vaish
It seems to me if we can cover the following cases most of the
instructions would get covered --
1. All six condition codes -- (OF, SF, ZF, AF, PF, CF)
2. Two classes of five condition codes -- (OF,SF,ZF,PF,CF), (OF, SF,
ZF, AF, PF)
3. One class of two condition codes -- (OF,CF)
Yasuko's current problem about ADD instructions will get solved if we
just handle the first case i.e. specify that if an instruction is
writing all the condition codes, then do not assume condition code
register to be a source register.
--
Nilay
Post by Gabe Black
Yeah, I think we've talked about this topic in the past, but it was a
while ago and I don't remember exactly what all we talked about or the
conclusion(s) we reached.
The problem at the ISA level is that there are lots of instructions in
x86 which are pretty basic and used a lot (adds, subtracts, etc.) which
compute condition codes every time in case you need them. That, combined
with the fact that the instructions which update the condition codes
update somewhat erratic combinations of bits, means that lots of
instructions write the condition code bits, and those same common
instructions read them too so they can do a partial update.
This has happened to a lesser extent before where there are control like
bits and condition code like bits in the same register. To my knowledge
that's happened at least on SPARC, ARM, and x86. That's dealt with by
splitting the condition code bits out into their own register, which is
treated as a renamed integer register, and the control bits which are
treated as a misc reg with all the overhead and special precautions.
That doesn't entirely work on x86, though, because even among the
condition code bits there are a lot of partial accesses as described
above. The cc bits could be broken down into individual bits, but that's
pretty cumbersome since there are, including the two artificial ones for
microcode, 8 of them I believe? That would be a lot of registers to
rename, would slow down the simulator, wouldn't be that realistic, etc.
What real CPUs do, after talking to someone in the know at AMD, is that
they gather up one group of flags, about 4 if I recall, and treat those
as a unit. The others are handled individually. The group of 4 is still
not 100% treated as a unit since some instructions modify just one of
them, for instance, but it's pretty close, optimizes for the common
case, and the odd cases can still work like they do today.
The difficulty implementing this is that exactly which condition code
bits to set and which to check for conditional microops are decided at
the microcode level and are arbitrary combinations. They don't need to
be completely arbitrary, but that means that microops really effectively
know which, how many, etc., condition code registers they need at
construction time as apposed to compile time. So what we'd need to do is
to allow the constructor for a microop to look at the flags it was being
given and to use that to more programatically figure out which registers
it had as sources or destinations, and how many. The body of the
instructions themselves would need to be sophisticated enough to pull
together the different source registers, whatever they are, and to
process them appropriately with a consistent bit of code (and not 18
different parameters to some function where 14 aren't used at any
particular time). It would also have to know how to split things back up
again when writing out the results.
What I did to move us a little bit in this direction is to make the
types of operands much more flexible so that we can have structures,
typedefs, etc. What we'd still need is truely composite operand types
where a single operand, for instance the condition code bits, is built
from a set of registers (determined in some way appropriate to the
operand) and/or written back to a set of registers, but which could be
handled easily as a single value inside the code blob. Then we can avoid
having 100(s) of versions of microops for all the different combinations
of flag bits, which would be a terrible thing to have to live with.
As far as easier ways to deal with this, there is only one which is what
I was alluding to in what I think was my earliest email, and that's to
just hack around it so the instructions you know you're using in the
performance sensitive part behave incorrectly generally speaking, but do
what you expect for the benchmark. Maybe they'd even have to know where
they were running from, that they were in a range of ticks, etc. A gross
and terrible hack unfit to check in, but something that would get the
poster unstuck for now. Doing things the "right" way will take some
infrastructure work, and that may not be very quick. I don't think
there's any real shortcut around doing the infrastructure work that
doesn't have a pretty heavy cost (like blowing up the number of microop
classes 100 fold).
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Your earlier email said "I've made some changes over time which should make
it easier to do this like a real x86 CPU would". Could you expand on that?
It sounded like you had some sort of plan or direction at least.
If we're
going to start working on this ourselves, it would be best if we can
benefit from whatever insights you've had or preliminary work you've done.
I see your later email says "I don't have any ideas for how to make it much
simpler", but that seems to contradict what you said at first. In
particular, you also earlier said "If you have an idea of how to get it to
do what you want locally, feel free. That will get you going, and when I
get it fixed for real then you can start using that.". I'd like to
explicitly reject that idea... for one thing, I'm not sure what a "local"
solution would look like, and more importantly, this issue seems
complicated enough that us doing some sort of temporary or stopgap solution
like you're implying, only to throw it away once you've done it "for real",
seems like a huge waste of effort. So overall I'd like to be sure we're in
sync with whatever you're thinking to make sure that our efforts are
additive and complementary and not redundant.
Thanks,
Steve
Post by Watanabe, Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabriel Michael Black
2012-04-09 23:13:51 UTC
Permalink
For the same reason we need microcode registers.

Gabe
Post by Nilay Vaish
I wrote what I observed from the output of following command -
grep "flags=" src/arch/x86/isa/insts/ -rn
Even with this output we can use (OF,SF,ZF,PF) as one group.
Can you elaborate on why we need those two extra bits, ECF and EZF?
--
Nilay
Post by Gabe Black
I don't think that's true, although I'm not willing to trawl though the
ISA manual to determine one way or the other. You'll need to just trust
me, since if AMD once implemented chips that way they intended to sell,
I doubt they just picked something from a hat. The group of four which
are considered together are I think ZAPS, zero, auxiliary carry (? it's
been a while), parity and sign, leaving overflow, carry, and the
artificial emulation zero and emulation carry flags. That's five groups
which is reasonable as far as having registers to rename, etc., but
still having a full cross product of combinations would be 64 variants.
I did look through the entire ISA manual's instruction listing once a
couple of years ago when I was researching this, and I think there are
only a small handful of instructions which behave badly with these
groups and, say, write to only the zero flag.
Gabe
Post by Nilay Vaish
It seems to me if we can cover the following cases most of the
instructions would get covered --
1. All six condition codes -- (OF, SF, ZF, AF, PF, CF)
2. Two classes of five condition codes -- (OF,SF,ZF,PF,CF), (OF, SF,
ZF, AF, PF)
3. One class of two condition codes -- (OF,CF)
Yasuko's current problem about ADD instructions will get solved if we
just handle the first case i.e. specify that if an instruction is
writing all the condition codes, then do not assume condition code
register to be a source register.
--
Nilay
Post by Gabe Black
Yeah, I think we've talked about this topic in the past, but it was a
while ago and I don't remember exactly what all we talked about or the
conclusion(s) we reached.
The problem at the ISA level is that there are lots of instructions in
x86 which are pretty basic and used a lot (adds, subtracts, etc.) which
compute condition codes every time in case you need them. That, combined
with the fact that the instructions which update the condition codes
update somewhat erratic combinations of bits, means that lots of
instructions write the condition code bits, and those same common
instructions read them too so they can do a partial update.
This has happened to a lesser extent before where there are control like
bits and condition code like bits in the same register. To my knowledge
that's happened at least on SPARC, ARM, and x86. That's dealt with by
splitting the condition code bits out into their own register, which is
treated as a renamed integer register, and the control bits which are
treated as a misc reg with all the overhead and special precautions.
That doesn't entirely work on x86, though, because even among the
condition code bits there are a lot of partial accesses as described
above. The cc bits could be broken down into individual bits, but that's
pretty cumbersome since there are, including the two artificial ones for
microcode, 8 of them I believe? That would be a lot of registers to
rename, would slow down the simulator, wouldn't be that realistic, etc.
What real CPUs do, after talking to someone in the know at AMD, is that
they gather up one group of flags, about 4 if I recall, and treat those
as a unit. The others are handled individually. The group of 4 is still
not 100% treated as a unit since some instructions modify just one of
them, for instance, but it's pretty close, optimizes for the common
case, and the odd cases can still work like they do today.
The difficulty implementing this is that exactly which condition code
bits to set and which to check for conditional microops are decided at
the microcode level and are arbitrary combinations. They don't need to
be completely arbitrary, but that means that microops really effectively
know which, how many, etc., condition code registers they need at
construction time as apposed to compile time. So what we'd need to do is
to allow the constructor for a microop to look at the flags it was being
given and to use that to more programatically figure out which registers
it had as sources or destinations, and how many. The body of the
instructions themselves would need to be sophisticated enough to pull
together the different source registers, whatever they are, and to
process them appropriately with a consistent bit of code (and not 18
different parameters to some function where 14 aren't used at any
particular time). It would also have to know how to split things back up
again when writing out the results.
What I did to move us a little bit in this direction is to make the
types of operands much more flexible so that we can have structures,
typedefs, etc. What we'd still need is truely composite operand types
where a single operand, for instance the condition code bits, is built
from a set of registers (determined in some way appropriate to the
operand) and/or written back to a set of registers, but which could be
handled easily as a single value inside the code blob. Then we can avoid
having 100(s) of versions of microops for all the different combinations
of flag bits, which would be a terrible thing to have to live with.
As far as easier ways to deal with this, there is only one which is what
I was alluding to in what I think was my earliest email, and that's to
just hack around it so the instructions you know you're using in the
performance sensitive part behave incorrectly generally speaking, but do
what you expect for the benchmark. Maybe they'd even have to know where
they were running from, that they were in a range of ticks, etc. A gross
and terrible hack unfit to check in, but something that would get the
poster unstuck for now. Doing things the "right" way will take some
infrastructure work, and that may not be very quick. I don't think
there's any real shortcut around doing the infrastructure work that
doesn't have a pretty heavy cost (like blowing up the number of microop
classes 100 fold).
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Your earlier email said "I've made some changes over time which should make
it easier to do this like a real x86 CPU would". Could you expand on that?
It sounded like you had some sort of plan or direction at least.
If we're
going to start working on this ourselves, it would be best if we can
benefit from whatever insights you've had or preliminary work you've done.
I see your later email says "I don't have any ideas for how to make it much
simpler", but that seems to contradict what you said at first. In
particular, you also earlier said "If you have an idea of how to get it to
do what you want locally, feel free. That will get you going, and when I
get it fixed for real then you can start using that.". I'd like to
explicitly reject that idea... for one thing, I'm not sure what a "local"
solution would look like, and more importantly, this issue seems
complicated enough that us doing some sort of temporary or stopgap solution
like you're implying, only to throw it away once you've done it "for real",
seems like a huge waste of effort. So overall I'd like to be sure we're in
sync with whatever you're thinking to make sure that our efforts are
additive and complementary and not redundant.
Thanks,
Steve
Post by Watanabe, Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2012-04-10 23:12:03 UTC
Permalink
Gabe, Thanks for the explanation.

I think I have a solution for the problem that will work when all the flag
bits will be written. I am assumping that the condition code register is
identified as a source because it appears on the right hand side of the
following code in src/arch/x86/isa/microops/regop.isa

class FlagRegOp(RegOp):
abstract = True
flag_code = \
"ccFlagBits = genFlags(ccFlagBits, ext, result, psrc1, op2);"

Suppose we want to remove the RAW dependence in case of 'add' microop. We
can create a new microop with a different mnemonic. This new microop will
be used in places where currently 'add' is in use with all the flags. This
microop will not inherit from FlagRegOp class as 'add' does. It will
inherit from a class that assumes a default value for ccFlagBits. We will
also need to add a default value of 0 for ccFlagBits in the prototype of
function genFlags().

But this will only help in case of addition and subtraction, as those are
the only ones that write all the six flag bits. As Gabe pointed out, for
any good solution to work, we would need to break up the flag bits.

--
Nilay
Post by Gabriel Michael Black
For the same reason we need microcode registers.
Gabe
Post by Nilay Vaish
I wrote what I observed from the output of following command -
grep "flags=" src/arch/x86/isa/insts/ -rn
Even with this output we can use (OF,SF,ZF,PF) as one group.
Can you elaborate on why we need those two extra bits, ECF and EZF?
--
Nilay
Post by Gabe Black
I don't think that's true, although I'm not willing to trawl though the
ISA manual to determine one way or the other. You'll need to just trust
me, since if AMD once implemented chips that way they intended to sell,
I doubt they just picked something from a hat. The group of four which
are considered together are I think ZAPS, zero, auxiliary carry (? it's
been a while), parity and sign, leaving overflow, carry, and the
artificial emulation zero and emulation carry flags. That's five groups
which is reasonable as far as having registers to rename, etc., but
still having a full cross product of combinations would be 64 variants.
I did look through the entire ISA manual's instruction listing once a
couple of years ago when I was researching this, and I think there are
only a small handful of instructions which behave badly with these
groups and, say, write to only the zero flag.
Gabe
Post by Nilay Vaish
It seems to me if we can cover the following cases most of the
instructions would get covered --
1. All six condition codes -- (OF, SF, ZF, AF, PF, CF)
2. Two classes of five condition codes -- (OF,SF,ZF,PF,CF), (OF, SF,
ZF, AF, PF)
3. One class of two condition codes -- (OF,CF)
Yasuko's current problem about ADD instructions will get solved if we
just handle the first case i.e. specify that if an instruction is
writing all the condition codes, then do not assume condition code
register to be a source register.
--
Nilay
Post by Gabe Black
Yeah, I think we've talked about this topic in the past, but it was a
while ago and I don't remember exactly what all we talked about or the
conclusion(s) we reached.
The problem at the ISA level is that there are lots of instructions in
x86 which are pretty basic and used a lot (adds, subtracts, etc.) which
compute condition codes every time in case you need them. That, combined
with the fact that the instructions which update the condition codes
update somewhat erratic combinations of bits, means that lots of
instructions write the condition code bits, and those same common
instructions read them too so they can do a partial update.
This has happened to a lesser extent before where there are control like
bits and condition code like bits in the same register. To my knowledge
that's happened at least on SPARC, ARM, and x86. That's dealt with by
splitting the condition code bits out into their own register, which is
treated as a renamed integer register, and the control bits which are
treated as a misc reg with all the overhead and special precautions.
That doesn't entirely work on x86, though, because even among the
condition code bits there are a lot of partial accesses as described
above. The cc bits could be broken down into individual bits, but that's
pretty cumbersome since there are, including the two artificial ones for
microcode, 8 of them I believe? That would be a lot of registers to
rename, would slow down the simulator, wouldn't be that realistic, etc.
What real CPUs do, after talking to someone in the know at AMD, is that
they gather up one group of flags, about 4 if I recall, and treat those
as a unit. The others are handled individually. The group of 4 is still
not 100% treated as a unit since some instructions modify just one of
them, for instance, but it's pretty close, optimizes for the common
case, and the odd cases can still work like they do today.
The difficulty implementing this is that exactly which condition code
bits to set and which to check for conditional microops are decided at
the microcode level and are arbitrary combinations. They don't need to
be completely arbitrary, but that means that microops really effectively
know which, how many, etc., condition code registers they need at
construction time as apposed to compile time. So what we'd need to do is
to allow the constructor for a microop to look at the flags it was being
given and to use that to more programatically figure out which registers
it had as sources or destinations, and how many. The body of the
instructions themselves would need to be sophisticated enough to pull
together the different source registers, whatever they are, and to
process them appropriately with a consistent bit of code (and not 18
different parameters to some function where 14 aren't used at any
particular time). It would also have to know how to split things back up
again when writing out the results.
What I did to move us a little bit in this direction is to make the
types of operands much more flexible so that we can have structures,
typedefs, etc. What we'd still need is truely composite operand types
where a single operand, for instance the condition code bits, is built
from a set of registers (determined in some way appropriate to the
operand) and/or written back to a set of registers, but which could be
handled easily as a single value inside the code blob. Then we can avoid
having 100(s) of versions of microops for all the different combinations
of flag bits, which would be a terrible thing to have to live with.
As far as easier ways to deal with this, there is only one which is what
I was alluding to in what I think was my earliest email, and that's to
just hack around it so the instructions you know you're using in the
performance sensitive part behave incorrectly generally speaking, but do
what you expect for the benchmark. Maybe they'd even have to know where
they were running from, that they were in a range of ticks, etc. A gross
and terrible hack unfit to check in, but something that would get the
poster unstuck for now. Doing things the "right" way will take some
infrastructure work, and that may not be very quick. I don't think
there's any real shortcut around doing the infrastructure work that
doesn't have a pretty heavy cost (like blowing up the number of microop
classes 100 fold).
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Your earlier email said "I've made some changes over time which should make
it easier to do this like a real x86 CPU would". Could you expand on that?
It sounded like you had some sort of plan or direction at least.
If we're
going to start working on this ourselves, it would be best if we can
benefit from whatever insights you've had or preliminary work you've done.
I see your later email says "I don't have any ideas for how to make it much
simpler", but that seems to contradict what you said at first. In
particular, you also earlier said "If you have an idea of how to get it to
do what you want locally, feel free. That will get you going, and when I
get it fixed for real then you can start using that.". I'd like to
explicitly reject that idea... for one thing, I'm not sure what a "local"
solution would look like, and more importantly, this issue seems
complicated enough that us doing some sort of temporary or stopgap solution
like you're implying, only to throw it away once you've done it "for real",
seems like a huge waste of effort. So overall I'd like to be sure we're in
sync with whatever you're thinking to make sure that our efforts are
additive and complementary and not redundant.
Thanks,
Steve
Post by Watanabe, Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2012-04-14 23:16:32 UTC
Permalink
The patch that was posted on the review board was along the lines of the
solution suggested in my previous mail. Note that only splitting the
ccFlags register is not the complete solution. We would still need to
check that whether or not all the flags in a particular register are being
completely written or not. Otherwise two 'add' instructions would still
not get done in parallel.

But Gabe is not happy with the current state of the patch, so I dug in to
the isa_parser and here is another proposal. In operands.isa, let's mark
the ccFlagBits so that we can identify it in isa_parser. When build the
operand map, we can now figure out that we have an operand that may be
read conditionally (similar to how the PC is read and updated). In
functions makeRead(), makeConstructor functions, if we encounter such an
operand, we include a call to a check function which decides whether or
not this operand will be read. The check function will need the ext (or
the flag bits) to figure out if the operand needs to be read.

How does this sound?

--
Nilay
Post by Nilay Vaish
Gabe, Thanks for the explanation.
I think I have a solution for the problem that will work when all the flag
bits will be written. I am assumping that the condition code register is
identified as a source because it appears on the right hand side of the
following code in src/arch/x86/isa/microops/regop.isa
abstract = True
flag_code = \
"ccFlagBits = genFlags(ccFlagBits, ext, result, psrc1, op2);"
Suppose we want to remove the RAW dependence in case of 'add' microop. We can
create a new microop with a different mnemonic. This new microop will be used
in places where currently 'add' is in use with all the flags. This microop
will not inherit from FlagRegOp class as 'add' does. It will inherit from a
class that assumes a default value for ccFlagBits. We will also need to add a
default value of 0 for ccFlagBits in the prototype of function genFlags().
But this will only help in case of addition and subtraction, as those are the
only ones that write all the six flag bits. As Gabe pointed out, for any good
solution to work, we would need to break up the flag bits.
--
Nilay
Gabe Black
2012-04-15 00:21:21 UTC
Permalink
This is more along the lines of where I want to go with this. The hard
part is going to be doing this cleanly so that we don't have a special
case hack in the parser for x86 and we don't make a big mess out of the
x86 ISA description (it's already very, uncomfortably, complicated).
This gets into what I was talking about before where the parser is smart
enough (probably with guidance and bits of code from the ISA
description) to deconstruct the way the flags are being used, to make
the right flag fragments sources and destinations, to read/write the
right pieces at the start and end of the instruction, and to assemble
and disassemble a single value to play with while in the microop's code,
all based off of construction time information, specifically which flags
are requested to be set. All of this would be quite possible or even
straightforward if doing this by hand, but then it would also be very
tedious and repetitive. So we need to figure out a way to cleanly fit it
into our code writing automation framework, aka the ISA parser, so it
can do the grubby work for us, all without making a great big mess in
the process. That's going to be a tricky bit of work, and it will have
to be done well.

Gabe
Post by Nilay Vaish
The patch that was posted on the review board was along the lines of
the solution suggested in my previous mail. Note that only splitting
the ccFlags register is not the complete solution. We would still need
to check that whether or not all the flags in a particular register
are being completely written or not. Otherwise two 'add' instructions
would still not get done in parallel.
But Gabe is not happy with the current state of the patch, so I dug in
to the isa_parser and here is another proposal. In operands.isa, let's
mark the ccFlagBits so that we can identify it in isa_parser. When
build the operand map, we can now figure out that we have an operand
that may be read conditionally (similar to how the PC is read and
updated). In functions makeRead(), makeConstructor functions, if we
encounter such an operand, we include a call to a check function which
decides whether or not this operand will be read. The check function
will need the ext (or the flag bits) to figure out if the operand
needs to be read.
How does this sound?
--
Nilay
Post by Nilay Vaish
Gabe, Thanks for the explanation.
I think I have a solution for the problem that will work when all the
flag bits will be written. I am assumping that the condition code
register is identified as a source because it appears on the right
hand side of the following code in src/arch/x86/isa/microops/regop.isa
abstract = True
flag_code = \
"ccFlagBits = genFlags(ccFlagBits, ext, result, psrc1, op2);"
Suppose we want to remove the RAW dependence in case of 'add'
microop. We can create a new microop with a different mnemonic. This
new microop will be used in places where currently 'add' is in use
with all the flags. This microop will not inherit from FlagRegOp
class as 'add' does. It will inherit from a class that assumes a
default value for ccFlagBits. We will also need to add a default
value of 0 for ccFlagBits in the prototype of function genFlags().
But this will only help in case of addition and subtraction, as those
are the only ones that write all the six flag bits. As Gabe pointed
out, for any good solution to work, we would need to break up the
flag bits.
--
Nilay
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2012-04-08 05:46:03 UTC
Permalink
Hi Gabe,

Thanks for all the info. For the most part this makes sense, but there's
one part I don't quite follow:

The difficulty implementing this is that exactly which condition code
Post by Gabe Black
bits to set and which to check for conditional microops are decided at
the microcode level and are arbitrary combinations.
I'm not quite sure what you mean by "conditional microops" here... can you
give a concrete example?

Thanks,

Steve
Gabe Black
2012-04-08 08:34:57 UTC
Permalink
Yeah, sure. Lets say there's a conditional jump, and it needs to write
the IP using the wrip microop. The exact condition you're testing will
depend on *which* conditional jump it is, but they're all implemented
using the same wrip microop which tests the condition codes and either
writes a new IP value or doesn't. The set of conditions is different
from the set of condition codes. If you want to look at how they map,
check out the checkCondition function in src/arch/x86/insts/microop.cc

Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Thanks for all the info. For the most part this makes sense, but there's
The difficulty implementing this is that exactly which condition code
Post by Gabe Black
bits to set and which to check for conditional microops are decided at
the microcode level and are arbitrary combinations.
I'm not quite sure what you mean by "conditional microops" here... can you
give a concrete example?
Thanks,
Steve
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Ali Saidi
2012-04-07 19:10:50 UTC
Permalink
I can't say I have a clue about the x86 condition codes, however for arm we successfully split up the condition codes into groups that were sticky and groups that were not and finally into groups of the sub groups that were written together. In doing so we got the o3 CPU to only insert dependencies between intuitions where there are real flag dependencies.

Thanks,
Ali

Sent from my ARM powered mobile device
Post by Gabe Black
It's complicated. Looking at it again I reminded myself of all the ways
it doesn't fit into the way the ISA parser does things, so it's going to
quite a bit of work to fix properly. I don't have any ideas for how to
make it much simpler that would be at all practical.
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me some pointers?
Yasuko
-----Original Message-----
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made some changes over time which should make it easier to do this like a real x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very busy, although circumstances may mean I have a little more or less time than normal for a little while so I don't know for sure when I'll get it fixed. If you have an idea of how to get it to do what you want locally, feel free. That will get you going, and when I get it fixed for real then you can start using that.
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be evaluated at bit level.
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in src/arch/x86/insts/microregop.cc suggests that the values of flag bits not updated by the ADD instruction need to be retained. This means that the previous values need to be read and written again, which means the second ADD can be dependent on a value written by the first ADD. If the dependencies were evaulated at bit level, then these instructions would not be dependent.
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a
much lower IPC than the theoretical IPC. The issue seems to be data
dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2012-04-08 00:58:44 UTC
Permalink
They x86 condition codes are a little different in that lots of
instructions update the flags just because that's what they do, and the
flags they update are a bit erratic, more so I think than ARM. I think
the solution will ultimately be along the lines of what ARM and
similarly x86 does now, just more of it which will require a bit more
sophistication to avoid it not scaling well.

Gabe
Post by Ali Saidi
I can't say I have a clue about the x86 condition codes, however for arm we successfully split up the condition codes into groups that were sticky and groups that were not and finally into groups of the sub groups that were written together. In doing so we got the o3 CPU to only insert dependencies between intuitions where there are real flag dependencies.
Thanks,
Ali
Sent from my ARM powered mobile device
Post by Gabe Black
It's complicated. Looking at it again I reminded myself of all the ways
it doesn't fit into the way the ISA parser does things, so it's going to
quite a bit of work to fix properly. I don't have any ideas for how to
make it much simpler that would be at all practical.
Gabe
Post by Watanabe, Yasuko
Hi Gabe,
Do you already have an idea of how to fix this? If so, can you give me some pointers?
Yasuko
-----Original Message-----
Sent: Thursday, April 05, 2012 6:12 PM
Subject: Re: [gem5-dev] Data dependency caused by flags
Yes, you guys are right. This is a recognized problem, and I've made some changes over time which should make it easier to do this like a real x86 CPU would. I haven't yet, but it's on the horizon. I tend to be very busy, although circumstances may mean I have a little more or less time than normal for a little while so I don't know for sure when I'll get it fixed. If you have an idea of how to get it to do what you want locally, feel free. That will get you going, and when I get it fixed for real then you can start using that.
Gabe
Post by Watanabe, Yasuko
Nilay,
I agree with you. I think the dependencies of those flag bits should be evaluated at bit level.
Gabe and others,
This change seems invasive. Do you know the best way to handle this?
Yasuko
-----Original Message-----
Behalf Of Nilay Vaish
Sent: Thursday, April 05, 2012 3:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Data dependency caused by flags
The code for the function genFlags() in src/arch/x86/insts/microregop.cc suggests that the values of flag bits not updated by the ADD instruction need to be retained. This means that the previous values need to be read and written again, which means the second ADD can be dependent on a value written by the first ADD. If the dependencies were evaulated at bit level, then these instructions would not be dependent.
--
Nilay
Post by Watanabe, Yasuko
I ran O3 CPU in FS mode in x86 with a simple microbenchmark and got a
much lower IPC than the theoretical IPC. The issue seems to be data
dependencies caused by (control) flags, not registers, and I am
wondering if anyone has come across the same issue.
The microbenchmark has many data independent ADD instructions
(http://repo.gem5.org/gem5/file/570b44fe6e04/src/arch/x86/isa/insts/g
e
neral_purpose/arithmetic/add_and_subtract.py#l41)
in a loop. On a 2-wide out-of-order machine with enough resources,
the IPC should be two at a steady stated. However, the IPC only goes
up to one. What is happening is that even though the ADDs have two
source and one destination registers and a flag to set in x86, gem5
adds one extra flag source register to the ADDs. As a result, each
ADD becomes dependent on the earlier ADD's destination flag,
constraining the achievable IPC to one.
ADD: S1=98, S2=9, S3=2, D1=82, D2=105 (flag)
ADD: S1=92, S2=9, S3=105 (flag), D1=79, D2=90 ...
Physical registers 98, 9, and 92 are ready when those two ADDs are
renamed; however, as you can see, the second ADD has to wait for the
first ADD because of the extra flag source register S3. When I
removed those flags in the macroop definition, the IPC jumped up from 1 to 1.7.
Does anyone know why the ADD has to read the flags, even though the
x86 manual does not say that? Those flags should just cause
write-after-write dependency, not read-after-write.
Yasuko
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Continue reading on narkive:
Loading...