Discussion:
[gem5-dev] Failed SPARC test
(too old to reply)
Steve Reinhardt
2011-10-24 00:18:03 UTC
Permalink
This makes sense, since the time the regression started failing is
consistent with when gcc was upgraded on zizzer.

I see there is a gcc-4.4 package available for ubuntu 11.04 (which zizzer is
running)... is there more to it than installing that package and recompiling
to get a workable binary to run tracediff with?

I'd try myself but I've forgotten my zizzer password (again!) so I can't
sudo. It's tough when you've had the same password for ten years then you
change it but don't use the new one much...

Steve

On Sun, Sep 25, 2011 at 1:14 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:

> Yes.. What Gabe said. With gcc 4.5 (version zizzer now runs) I cannot find
> a version of the repository that passes sparc boot. I'm pretty sure it's an
> annoying compiler issue, but there are some annoyances is figuring out where
> to look at Gabe points out. If you're stats changes work on everything else,
> I'm happy to see them committed while this issue goes on in the background.
>
> Thanks,
>
> Ali
>
> Sent from my ARM powered device
>
> On Sep 25, 2011, at 3:06 PM, Gabe Black <gblack-***@public.gmane.org> wrote:
>
> > We (Ali and I) have each looked at that before, and we think it depends
> > on the compiler version. Something changes when you have a new enough
> > gcc and then the behavior of SPARC changes. I think the new behavior is
> > broken and the old behavior is correct, but I'd have to look at it
> > again. I haven't looked into it farther than that yet because I'd want
> > to tracediff between versions built with different compilers. Since they
> > would need to find different versions of libraries and can't just run
> > from the same command line, it's logistically annoying.
> >
> > Gabe
> >
> > On 09/25/11 09:52, nathan binkert wrote:
> >> I'm trying to get my python stats changes into the tree, but it
> >> appears that one of the regression tests no longer works (zizzer
> >> agrees with me):
> >>
> >>
> SPARC_FS/tests/opt/long/80.solaris-boot/sparc/solaris/t1000-simple-atomic
> >>
> >>
> >> Gabe, I think you're the only one that's been messing with SPARC. Can
> >> you take a look?
> >>
> >> Nate
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-dev-1Gs4CP2/***@public.gmane.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev-1Gs4CP2/***@public.gmane.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Ali Saidi
2011-10-24 02:50:09 UTC
Permalink
I've installed it.

Ali

On Oct 23, 2011, at 7:18 PM, Steve Reinhardt wrote:

> This makes sense, since the time the regression started failing is
> consistent with when gcc was upgraded on zizzer.
>
> I see there is a gcc-4.4 package available for ubuntu 11.04 (which zizzer is
> running)... is there more to it than installing that package and recompiling
> to get a workable binary to run tracediff with?
>
> I'd try myself but I've forgotten my zizzer password (again!) so I can't
> sudo. It's tough when you've had the same password for ten years then you
> change it but don't use the new one much...
>
> Steve
>
> On Sun, Sep 25, 2011 at 1:14 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>
>> Yes.. What Gabe said. With gcc 4.5 (version zizzer now runs) I cannot find
>> a version of the repository that passes sparc boot. I'm pretty sure it's an
>> annoying compiler issue, but there are some annoyances is figuring out where
>> to look at Gabe points out. If you're stats changes work on everything else,
>> I'm happy to see them committed while this issue goes on in the background.
>>
>> Thanks,
>>
>> Ali
>>
>> Sent from my ARM powered device
>>
>> On Sep 25, 2011, at 3:06 PM, Gabe Black <gblack-***@public.gmane.org> wrote:
>>
>>> We (Ali and I) have each looked at that before, and we think it depends
>>> on the compiler version. Something changes when you have a new enough
>>> gcc and then the behavior of SPARC changes. I think the new behavior is
>>> broken and the old behavior is correct, but I'd have to look at it
>>> again. I haven't looked into it farther than that yet because I'd want
>>> to tracediff between versions built with different compilers. Since they
>>> would need to find different versions of libraries and can't just run
>>> from the same command line, it's logistically annoying.
>>>
>>> Gabe
>>>
>>> On 09/25/11 09:52, nathan binkert wrote:
>>>> I'm trying to get my python stats changes into the tree, but it
>>>> appears that one of the regression tests no longer works (zizzer
>>>> agrees with me):
>>>>
>>>>
>> SPARC_FS/tests/opt/long/80.solaris-boot/sparc/solaris/t1000-simple-atomic
>>>>
>>>>
>>>> Gabe, I think you're the only one that's been messing with SPARC. Can
>>>> you take a look?
>>>>
>>>> Nate
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-dev-1Gs4CP2/***@public.gmane.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Steve Reinhardt
2011-10-24 16:19:05 UTC
Permalink
Great, thanks a lot. I was able to build with
'CC=/usr/bin/gcc-4.4 CXX=/usr/bin/g++-4.4' and get a binary that passes this
test on the head, so it's definitely the compiler. I also ran tracediff and
it looks like it's an off-by-one thing with %fp; here's the first error:

-931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 + -0x20] :
MemWrite : D=0x423000000000197a A=0xfeffa280
+931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 + -0x20] :
MemWrite : D=0x4230000000001979 A=0xfeffa280

(The good gcc-4.4 version is second, so the '1979' is the correct value
here.)

I ran one more tracediff with '--debug-flag=All --trace-start=931600000' to
see if anything else turns up sooner, and got this:

@@ -1380553 +1380553 @@
931697014: system.cpu.[tid:0]: Reading float reg 3 (3) bits as 0, 0.
931697014: system.cpu.[tid:0]: Reading float reg 2 (2) bits as 0x3e300000,
0.171875.
931697014: global: FSR read as: 0xc0000000
-931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to 0, 0.
+931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to
0x80000000, -0.
931697014: system.cpu.[tid:0]: Setting float reg 13 (13) bits to 0, 0.
931697014: global: FSR written with: 0xc0000000
931697014: system.cpu + A16 T0 : 0xff1aa434 : fsubd
%f31,%f30,%f12 : FloatAdd : D=0x00000000c0000000
@@ -1380951 +1380951 @@
931697038: system.cpu.[tid:0]: Reading float reg 5 (5) bits as 0, 0.
931697038: system.cpu.[tid:0]: Reading float reg 4 (4) bits as 0, 0.
931697038: system.cpu.[tid:0]: Reading float reg 13 (13) bits as 0, 0.
-931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as 0, 0.
+931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as
0x80000000, -0.
931697038: global: FSR read as: 0xc0000000
931697038: system.cpu.[tid:0]: Setting float reg 18 (18) bits to 0, 0.
931697038: system.cpu.[tid:0]: Setting float reg 19 (19) bits to 0, 0.
@@ -1381022 +1381022 @@
931697042: system.cpu.[tid:0]: Reading float reg 10 (10) bits as
0x41300000, 11.
931697042: global: FSR read as: 0xc0000000
931697042: system.cpu.[tid:0]: Setting float reg 16 (16) bits to
0x41300000, 11.
-931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe685,
8.26948e-41.
+931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe684,
8.26934e-41.
931697042: global: FSR written with: 0xc0000000
931697042: system.cpu + A16 T0 : 0xff1aa4a4 : faddd %f3,%f2,%f16
: FloatAdd : D=0x00000000c0000000
931697042: Event_18: AtomicSimpleCPU tick event scheduled @ 931697043

Could it be some kind of FP rounding error? It's not clear how that would
end up affecting %fp though. (Actually, looking at this a little closer,
are we even disassembling that correctly? Seems to me it should be 'stdf
%f29, [%fp + -0x20]'.)

I won't have time to look into this further anytime soon, but I hope this
will give someone else (Gabe?) enough to go on to get this figured out.

Thanks,

Steve


On Sun, Oct 23, 2011 at 7:50 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:

> I've installed it.
>
> Ali
>
> On Oct 23, 2011, at 7:18 PM, Steve Reinhardt wrote:
>
> > This makes sense, since the time the regression started failing is
> > consistent with when gcc was upgraded on zizzer.
> >
> > I see there is a gcc-4.4 package available for ubuntu 11.04 (which zizzer
> is
> > running)... is there more to it than installing that package and
> recompiling
> > to get a workable binary to run tracediff with?
> >
> > I'd try myself but I've forgotten my zizzer password (again!) so I can't
> > sudo. It's tough when you've had the same password for ten years then
> you
> > change it but don't use the new one much...
> >
> > Steve
> >
> > On Sun, Sep 25, 2011 at 1:14 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
> >
> >> Yes.. What Gabe said. With gcc 4.5 (version zizzer now runs) I cannot
> find
> >> a version of the repository that passes sparc boot. I'm pretty sure
> it's an
> >> annoying compiler issue, but there are some annoyances is figuring out
> where
> >> to look at Gabe points out. If you're stats changes work on everything
> else,
> >> I'm happy to see them committed while this issue goes on in the
> background.
> >>
> >> Thanks,
> >>
> >> Ali
> >>
> >> Sent from my ARM powered device
> >>
> >> On Sep 25, 2011, at 3:06 PM, Gabe Black <gblack-***@public.gmane.org> wrote:
> >>
> >>> We (Ali and I) have each looked at that before, and we think it depends
> >>> on the compiler version. Something changes when you have a new enough
> >>> gcc and then the behavior of SPARC changes. I think the new behavior is
> >>> broken and the old behavior is correct, but I'd have to look at it
> >>> again. I haven't looked into it farther than that yet because I'd want
> >>> to tracediff between versions built with different compilers. Since
> they
> >>> would need to find different versions of libraries and can't just run
> >>> from the same command line, it's logistically annoying.
> >>>
> >>> Gabe
> >>>
> >>> On 09/25/11 09:52, nathan binkert wrote:
> >>>> I'm trying to get my python stats changes into the tree, but it
> >>>> appears that one of the regression tests no longer works (zizzer
> >>>> agrees with me):
> >>>>
> >>>>
> >>
> SPARC_FS/tests/opt/long/80.solaris-boot/sparc/solaris/t1000-simple-atomic
> >>>>
> >>>>
> >>>> Gabe, I think you're the only one that's been messing with SPARC. Can
> >>>> you take a look?
> >>>>
> >>>> Nate
> >>>> _______________________________________________
> >>>> gem5-dev mailing list
> >>>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>
> >>> _______________________________________________
> >>> gem5-dev mailing list
> >>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-dev-1Gs4CP2/***@public.gmane.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev-1Gs4CP2/***@public.gmane.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabe Black
2011-10-25 07:17:05 UTC
Permalink
An FP rounding error seems very plausible, but I'm not sure how +/- zero
would make any difference. I'm skeptical that our FP implementation in
SPARC is accurate enough to care much about such a small difference,
although it is, of course, entirely possible it cascades from there into
a larger difference which breaks things.

I've gone back and improved the SPARC disassembly in the past, but it's
still not perfect. The problem is the hierarchy that works for getting
instructions to work doesn't necessarily mirror the one you need to get
accurate disassembly. I think I went with operand position too (src 0 is
for this, dest 0 is for that) and that doesn't always work very well.
That's probably what's going wrong here.

Is there a point after this where things diverge significantly? This
could be just a blip of noise and the real problem happens a lot later.
It's a *major* pain in the butt to write code that theoretically handles
all the little FP weird cases and gets all the bits right when the host
ISA has different rules for FP than the guiest, and it's even harder to
actually get the compiler to generate that code without moving things
around and messing it all up. And glibc's FP support is wrong sometimes!
What fun. I largely think it's farther on, and also partially am holding
out hope we don't have to wade into FP soup.

Gabe

On 10/24/11 09:19, Steve Reinhardt wrote:
> Great, thanks a lot. I was able to build with
> 'CC=/usr/bin/gcc-4.4 CXX=/usr/bin/g++-4.4' and get a binary that passes this
> test on the head, so it's definitely the compiler. I also ran tracediff and
> it looks like it's an off-by-one thing with %fp; here's the first error:
>
> -931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 + -0x20] :
> MemWrite : D=0x423000000000197a A=0xfeffa280
> +931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 + -0x20] :
> MemWrite : D=0x4230000000001979 A=0xfeffa280
>
> (The good gcc-4.4 version is second, so the '1979' is the correct value
> here.)
>
> I ran one more tracediff with '--debug-flag=All --trace-start=931600000' to
> see if anything else turns up sooner, and got this:
>
> @@ -1380553 +1380553 @@
> 931697014: system.cpu.[tid:0]: Reading float reg 3 (3) bits as 0, 0.
> 931697014: system.cpu.[tid:0]: Reading float reg 2 (2) bits as 0x3e300000,
> 0.171875.
> 931697014: global: FSR read as: 0xc0000000
> -931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to 0, 0.
> +931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to
> 0x80000000, -0.
> 931697014: system.cpu.[tid:0]: Setting float reg 13 (13) bits to 0, 0.
> 931697014: global: FSR written with: 0xc0000000
> 931697014: system.cpu + A16 T0 : 0xff1aa434 : fsubd
> %f31,%f30,%f12 : FloatAdd : D=0x00000000c0000000
> @@ -1380951 +1380951 @@
> 931697038: system.cpu.[tid:0]: Reading float reg 5 (5) bits as 0, 0.
> 931697038: system.cpu.[tid:0]: Reading float reg 4 (4) bits as 0, 0.
> 931697038: system.cpu.[tid:0]: Reading float reg 13 (13) bits as 0, 0.
> -931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as 0, 0.
> +931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as
> 0x80000000, -0.
> 931697038: global: FSR read as: 0xc0000000
> 931697038: system.cpu.[tid:0]: Setting float reg 18 (18) bits to 0, 0.
> 931697038: system.cpu.[tid:0]: Setting float reg 19 (19) bits to 0, 0.
> @@ -1381022 +1381022 @@
> 931697042: system.cpu.[tid:0]: Reading float reg 10 (10) bits as
> 0x41300000, 11.
> 931697042: global: FSR read as: 0xc0000000
> 931697042: system.cpu.[tid:0]: Setting float reg 16 (16) bits to
> 0x41300000, 11.
> -931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe685,
> 8.26948e-41.
> +931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe684,
> 8.26934e-41.
> 931697042: global: FSR written with: 0xc0000000
> 931697042: system.cpu + A16 T0 : 0xff1aa4a4 : faddd %f3,%f2,%f16
> : FloatAdd : D=0x00000000c0000000
> 931697042: Event_18: AtomicSimpleCPU tick event scheduled @ 931697043
>
> Could it be some kind of FP rounding error? It's not clear how that would
> end up affecting %fp though. (Actually, looking at this a little closer,
> are we even disassembling that correctly? Seems to me it should be 'stdf
> %f29, [%fp + -0x20]'.)
>
> I won't have time to look into this further anytime soon, but I hope this
> will give someone else (Gabe?) enough to go on to get this figured out.
>
> Thanks,
>
> Steve
>
>
> On Sun, Oct 23, 2011 at 7:50 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>
>> I've installed it.
>>
>> Ali
>>
>> On Oct 23, 2011, at 7:18 PM, Steve Reinhardt wrote:
>>
>>> This makes sense, since the time the regression started failing is
>>> consistent with when gcc was upgraded on zizzer.
>>>
>>> I see there is a gcc-4.4 package available for ubuntu 11.04 (which zizzer
>> is
>>> running)... is there more to it than installing that package and
>> recompiling
>>> to get a workable binary to run tracediff with?
>>>
>>> I'd try myself but I've forgotten my zizzer password (again!) so I can't
>>> sudo. It's tough when you've had the same password for ten years then
>> you
>>> change it but don't use the new one much...
>>>
>>> Steve
>>>
>>> On Sun, Sep 25, 2011 at 1:14 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>>>
>>>> Yes.. What Gabe said. With gcc 4.5 (version zizzer now runs) I cannot
>> find
>>>> a version of the repository that passes sparc boot. I'm pretty sure
>> it's an
>>>> annoying compiler issue, but there are some annoyances is figuring out
>> where
>>>> to look at Gabe points out. If you're stats changes work on everything
>> else,
>>>> I'm happy to see them committed while this issue goes on in the
>> background.
>>>> Thanks,
>>>>
>>>> Ali
>>>>
>>>> Sent from my ARM powered device
>>>>
>>>> On Sep 25, 2011, at 3:06 PM, Gabe Black <gblack-***@public.gmane.org> wrote:
>>>>
>>>>> We (Ali and I) have each looked at that before, and we think it depends
>>>>> on the compiler version. Something changes when you have a new enough
>>>>> gcc and then the behavior of SPARC changes. I think the new behavior is
>>>>> broken and the old behavior is correct, but I'd have to look at it
>>>>> again. I haven't looked into it farther than that yet because I'd want
>>>>> to tracediff between versions built with different compilers. Since
>> they
>>>>> would need to find different versions of libraries and can't just run
>>>>> from the same command line, it's logistically annoying.
>>>>>
>>>>> Gabe
>>>>>
>>>>> On 09/25/11 09:52, nathan binkert wrote:
>>>>>> I'm trying to get my python stats changes into the tree, but it
>>>>>> appears that one of the regression tests no longer works (zizzer
>>>>>> agrees with me):
>>>>>>
>>>>>>
>> SPARC_FS/tests/opt/long/80.solaris-boot/sparc/solaris/t1000-simple-atomic
>>>>>>
>>>>>> Gabe, I think you're the only one that's been messing with SPARC. Can
>>>>>> you take a look?
>>>>>>
>>>>>> Nate
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-dev-1Gs4CP2/***@public.gmane.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-10-25 07:32:28 UTC
Permalink
Hard to tell... there are larger and larger differences after that point
that seem to be cascading from this one, but it takes a while before they
diverge completely. I put the trace in /tmp/tracediff-8625.out on zizzer if
you want to take a look for yourself.

It seems odd that the solaris boot would be doing that much FP in any case,
but there does seem to be quite a bit of it.

Steve


On Tue, Oct 25, 2011 at 12:17 AM, Gabe Black <gblack-***@public.gmane.org> wrote:

> An FP rounding error seems very plausible, but I'm not sure how +/- zero
> would make any difference. I'm skeptical that our FP implementation in
> SPARC is accurate enough to care much about such a small difference,
> although it is, of course, entirely possible it cascades from there into
> a larger difference which breaks things.
>
> I've gone back and improved the SPARC disassembly in the past, but it's
> still not perfect. The problem is the hierarchy that works for getting
> instructions to work doesn't necessarily mirror the one you need to get
> accurate disassembly. I think I went with operand position too (src 0 is
> for this, dest 0 is for that) and that doesn't always work very well.
> That's probably what's going wrong here.
>
> Is there a point after this where things diverge significantly? This
> could be just a blip of noise and the real problem happens a lot later.
> It's a *major* pain in the butt to write code that theoretically handles
> all the little FP weird cases and gets all the bits right when the host
> ISA has different rules for FP than the guiest, and it's even harder to
> actually get the compiler to generate that code without moving things
> around and messing it all up. And glibc's FP support is wrong sometimes!
> What fun. I largely think it's farther on, and also partially am holding
> out hope we don't have to wade into FP soup.
>
> Gabe
>
> On 10/24/11 09:19, Steve Reinhardt wrote:
> > Great, thanks a lot. I was able to build with
> > 'CC=/usr/bin/gcc-4.4 CXX=/usr/bin/g++-4.4' and get a binary that passes
> this
> > test on the head, so it's definitely the compiler. I also ran tracediff
> and
> > it looks like it's an off-by-one thing with %fp; here's the first error:
> >
> > -931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
> -0x20] :
> > MemWrite : D=0x423000000000197a A=0xfeffa280
> > +931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
> -0x20] :
> > MemWrite : D=0x4230000000001979 A=0xfeffa280
> >
> > (The good gcc-4.4 version is second, so the '1979' is the correct value
> > here.)
> >
> > I ran one more tracediff with '--debug-flag=All --trace-start=931600000'
> to
> > see if anything else turns up sooner, and got this:
> >
> > @@ -1380553 +1380553 @@
> > 931697014: system.cpu.[tid:0]: Reading float reg 3 (3) bits as 0, 0.
> > 931697014: system.cpu.[tid:0]: Reading float reg 2 (2) bits as
> 0x3e300000,
> > 0.171875.
> > 931697014: global: FSR read as: 0xc0000000
> > -931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to 0, 0.
> > +931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to
> > 0x80000000, -0.
> > 931697014: system.cpu.[tid:0]: Setting float reg 13 (13) bits to 0, 0.
> > 931697014: global: FSR written with: 0xc0000000
> > 931697014: system.cpu + A16 T0 : 0xff1aa434 : fsubd
> > %f31,%f30,%f12 : FloatAdd : D=0x00000000c0000000
> > @@ -1380951 +1380951 @@
> > 931697038: system.cpu.[tid:0]: Reading float reg 5 (5) bits as 0, 0.
> > 931697038: system.cpu.[tid:0]: Reading float reg 4 (4) bits as 0, 0.
> > 931697038: system.cpu.[tid:0]: Reading float reg 13 (13) bits as 0, 0.
> > -931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as 0, 0.
> > +931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as
> > 0x80000000, -0.
> > 931697038: global: FSR read as: 0xc0000000
> > 931697038: system.cpu.[tid:0]: Setting float reg 18 (18) bits to 0, 0.
> > 931697038: system.cpu.[tid:0]: Setting float reg 19 (19) bits to 0, 0.
> > @@ -1381022 +1381022 @@
> > 931697042: system.cpu.[tid:0]: Reading float reg 10 (10) bits as
> > 0x41300000, 11.
> > 931697042: global: FSR read as: 0xc0000000
> > 931697042: system.cpu.[tid:0]: Setting float reg 16 (16) bits to
> > 0x41300000, 11.
> > -931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe685,
> > 8.26948e-41.
> > +931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe684,
> > 8.26934e-41.
> > 931697042: global: FSR written with: 0xc0000000
> > 931697042: system.cpu + A16 T0 : 0xff1aa4a4 : faddd
> %f3,%f2,%f16
> > : FloatAdd : D=0x00000000c0000000
> > 931697042: Event_18: AtomicSimpleCPU tick event scheduled @ 931697043
> >
> > Could it be some kind of FP rounding error? It's not clear how that
> would
> > end up affecting %fp though. (Actually, looking at this a little closer,
> > are we even disassembling that correctly? Seems to me it should be 'stdf
> > %f29, [%fp + -0x20]'.)
> >
> > I won't have time to look into this further anytime soon, but I hope this
> > will give someone else (Gabe?) enough to go on to get this figured out.
> >
> > Thanks,
> >
> > Steve
> >
> >
> > On Sun, Oct 23, 2011 at 7:50 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
> >
> >> I've installed it.
> >>
> >> Ali
> >>
> >> On Oct 23, 2011, at 7:18 PM, Steve Reinhardt wrote:
> >>
> >>> This makes sense, since the time the regression started failing is
> >>> consistent with when gcc was upgraded on zizzer.
> >>>
> >>> I see there is a gcc-4.4 package available for ubuntu 11.04 (which
> zizzer
> >> is
> >>> running)... is there more to it than installing that package and
> >> recompiling
> >>> to get a workable binary to run tracediff with?
> >>>
> >>> I'd try myself but I've forgotten my zizzer password (again!) so I
> can't
> >>> sudo. It's tough when you've had the same password for ten years then
> >> you
> >>> change it but don't use the new one much...
> >>>
> >>> Steve
> >>>
> >>> On Sun, Sep 25, 2011 at 1:14 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
> >>>
> >>>> Yes.. What Gabe said. With gcc 4.5 (version zizzer now runs) I cannot
> >> find
> >>>> a version of the repository that passes sparc boot. I'm pretty sure
> >> it's an
> >>>> annoying compiler issue, but there are some annoyances is figuring out
> >> where
> >>>> to look at Gabe points out. If you're stats changes work on everything
> >> else,
> >>>> I'm happy to see them committed while this issue goes on in the
> >> background.
> >>>> Thanks,
> >>>>
> >>>> Ali
> >>>>
> >>>> Sent from my ARM powered device
> >>>>
> >>>> On Sep 25, 2011, at 3:06 PM, Gabe Black <gblack-***@public.gmane.org>
> wrote:
> >>>>
> >>>>> We (Ali and I) have each looked at that before, and we think it
> depends
> >>>>> on the compiler version. Something changes when you have a new enough
> >>>>> gcc and then the behavior of SPARC changes. I think the new behavior
> is
> >>>>> broken and the old behavior is correct, but I'd have to look at it
> >>>>> again. I haven't looked into it farther than that yet because I'd
> want
> >>>>> to tracediff between versions built with different compilers. Since
> >> they
> >>>>> would need to find different versions of libraries and can't just run
> >>>>> from the same command line, it's logistically annoying.
> >>>>>
> >>>>> Gabe
> >>>>>
> >>>>> On 09/25/11 09:52, nathan binkert wrote:
> >>>>>> I'm trying to get my python stats changes into the tree, but it
> >>>>>> appears that one of the regression tests no longer works (zizzer
> >>>>>> agrees with me):
> >>>>>>
> >>>>>>
> >>
> SPARC_FS/tests/opt/long/80.solaris-boot/sparc/solaris/t1000-simple-atomic
> >>>>>>
> >>>>>> Gabe, I think you're the only one that's been messing with SPARC.
> Can
> >>>>>> you take a look?
> >>>>>>
> >>>>>> Nate
> >>>>>> _______________________________________________
> >>>>>> gem5-dev mailing list
> >>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>> _______________________________________________
> >>>>> gem5-dev mailing list
> >>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>
> >>>> _______________________________________________
> >>>> gem5-dev mailing list
> >>>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>
> >>> _______________________________________________
> >>> gem5-dev mailing list
> >>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-dev-1Gs4CP2/***@public.gmane.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev-1Gs4CP2/***@public.gmane.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabe Black
2011-10-25 09:30:54 UTC
Permalink
Ah, ok, I was just being dumb. All the stdf-s and lddf-s are just moving
memory around, I think. That way you can load/store 64 bits at a time
and get it done with fewer instructions. I think those instructions
themselves can be ignored. I'm also surprised that there would be much
floating point.

I'm currently building binutils for SPARC, so hopefully I can
disassemble some things and get a better idea of what's going on. It's
probably going to be really annoying to figure it out.

Gabe

On 10/25/11 00:32, Steve Reinhardt wrote:
> Hard to tell... there are larger and larger differences after that point
> that seem to be cascading from this one, but it takes a while before they
> diverge completely. I put the trace in /tmp/tracediff-8625.out on zizzer if
> you want to take a look for yourself.
>
> It seems odd that the solaris boot would be doing that much FP in any case,
> but there does seem to be quite a bit of it.
>
> Steve
>
>
> On Tue, Oct 25, 2011 at 12:17 AM, Gabe Black <gblack-***@public.gmane.org> wrote:
>
>> An FP rounding error seems very plausible, but I'm not sure how +/- zero
>> would make any difference. I'm skeptical that our FP implementation in
>> SPARC is accurate enough to care much about such a small difference,
>> although it is, of course, entirely possible it cascades from there into
>> a larger difference which breaks things.
>>
>> I've gone back and improved the SPARC disassembly in the past, but it's
>> still not perfect. The problem is the hierarchy that works for getting
>> instructions to work doesn't necessarily mirror the one you need to get
>> accurate disassembly. I think I went with operand position too (src 0 is
>> for this, dest 0 is for that) and that doesn't always work very well.
>> That's probably what's going wrong here.
>>
>> Is there a point after this where things diverge significantly? This
>> could be just a blip of noise and the real problem happens a lot later.
>> It's a *major* pain in the butt to write code that theoretically handles
>> all the little FP weird cases and gets all the bits right when the host
>> ISA has different rules for FP than the guiest, and it's even harder to
>> actually get the compiler to generate that code without moving things
>> around and messing it all up. And glibc's FP support is wrong sometimes!
>> What fun. I largely think it's farther on, and also partially am holding
>> out hope we don't have to wade into FP soup.
>>
>> Gabe
>>
>> On 10/24/11 09:19, Steve Reinhardt wrote:
>>> Great, thanks a lot. I was able to build with
>>> 'CC=/usr/bin/gcc-4.4 CXX=/usr/bin/g++-4.4' and get a binary that passes
>> this
>>> test on the head, so it's definitely the compiler. I also ran tracediff
>> and
>>> it looks like it's an off-by-one thing with %fp; here's the first error:
>>>
>>> -931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
>> -0x20] :
>>> MemWrite : D=0x423000000000197a A=0xfeffa280
>>> +931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
>> -0x20] :
>>> MemWrite : D=0x4230000000001979 A=0xfeffa280
>>>
>>> (The good gcc-4.4 version is second, so the '1979' is the correct value
>>> here.)
>>>
>>> I ran one more tracediff with '--debug-flag=All --trace-start=931600000'
>> to
>>> see if anything else turns up sooner, and got this:
>>>
>>> @@ -1380553 +1380553 @@
>>> 931697014: system.cpu.[tid:0]: Reading float reg 3 (3) bits as 0, 0.
>>> 931697014: system.cpu.[tid:0]: Reading float reg 2 (2) bits as
>> 0x3e300000,
>>> 0.171875.
>>> 931697014: global: FSR read as: 0xc0000000
>>> -931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to 0, 0.
>>> +931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to
>>> 0x80000000, -0.
>>> 931697014: system.cpu.[tid:0]: Setting float reg 13 (13) bits to 0, 0.
>>> 931697014: global: FSR written with: 0xc0000000
>>> 931697014: system.cpu + A16 T0 : 0xff1aa434 : fsubd
>>> %f31,%f30,%f12 : FloatAdd : D=0x00000000c0000000
>>> @@ -1380951 +1380951 @@
>>> 931697038: system.cpu.[tid:0]: Reading float reg 5 (5) bits as 0, 0.
>>> 931697038: system.cpu.[tid:0]: Reading float reg 4 (4) bits as 0, 0.
>>> 931697038: system.cpu.[tid:0]: Reading float reg 13 (13) bits as 0, 0.
>>> -931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as 0, 0.
>>> +931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as
>>> 0x80000000, -0.
>>> 931697038: global: FSR read as: 0xc0000000
>>> 931697038: system.cpu.[tid:0]: Setting float reg 18 (18) bits to 0, 0.
>>> 931697038: system.cpu.[tid:0]: Setting float reg 19 (19) bits to 0, 0.
>>> @@ -1381022 +1381022 @@
>>> 931697042: system.cpu.[tid:0]: Reading float reg 10 (10) bits as
>>> 0x41300000, 11.
>>> 931697042: global: FSR read as: 0xc0000000
>>> 931697042: system.cpu.[tid:0]: Setting float reg 16 (16) bits to
>>> 0x41300000, 11.
>>> -931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe685,
>>> 8.26948e-41.
>>> +931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe684,
>>> 8.26934e-41.
>>> 931697042: global: FSR written with: 0xc0000000
>>> 931697042: system.cpu + A16 T0 : 0xff1aa4a4 : faddd
>> %f3,%f2,%f16
>>> : FloatAdd : D=0x00000000c0000000
>>> 931697042: Event_18: AtomicSimpleCPU tick event scheduled @ 931697043
>>>
>>> Could it be some kind of FP rounding error? It's not clear how that
>> would
>>> end up affecting %fp though. (Actually, looking at this a little closer,
>>> are we even disassembling that correctly? Seems to me it should be 'stdf
>>> %f29, [%fp + -0x20]'.)
>>>
>>> I won't have time to look into this further anytime soon, but I hope this
>>> will give someone else (Gabe?) enough to go on to get this figured out.
>>>
>>> Thanks,
>>>
>>> Steve
>>>
>>>
>>> On Sun, Oct 23, 2011 at 7:50 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>>>
>>>> I've installed it.
>>>>
>>>> Ali
>>>>
>>>> On Oct 23, 2011, at 7:18 PM, Steve Reinhardt wrote:
>>>>
>>>>> This makes sense, since the time the regression started failing is
>>>>> consistent with when gcc was upgraded on zizzer.
>>>>>
>>>>> I see there is a gcc-4.4 package available for ubuntu 11.04 (which
>> zizzer
>>>> is
>>>>> running)... is there more to it than installing that package and
>>>> recompiling
>>>>> to get a workable binary to run tracediff with?
>>>>>
>>>>> I'd try myself but I've forgotten my zizzer password (again!) so I
>> can't
>>>>> sudo. It's tough when you've had the same password for ten years then
>>>> you
>>>>> change it but don't use the new one much...
>>>>>
>>>>> Steve
>>>>>
>>>>> On Sun, Sep 25, 2011 at 1:14 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>>>>>
>>>>>> Yes.. What Gabe said. With gcc 4.5 (version zizzer now runs) I cannot
>>>> find
>>>>>> a version of the repository that passes sparc boot. I'm pretty sure
>>>> it's an
>>>>>> annoying compiler issue, but there are some annoyances is figuring out
>>>> where
>>>>>> to look at Gabe points out. If you're stats changes work on everything
>>>> else,
>>>>>> I'm happy to see them committed while this issue goes on in the
>>>> background.
>>>>>> Thanks,
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>> Sent from my ARM powered device
>>>>>>
>>>>>> On Sep 25, 2011, at 3:06 PM, Gabe Black <gblack-***@public.gmane.org>
>> wrote:
>>>>>>> We (Ali and I) have each looked at that before, and we think it
>> depends
>>>>>>> on the compiler version. Something changes when you have a new enough
>>>>>>> gcc and then the behavior of SPARC changes. I think the new behavior
>> is
>>>>>>> broken and the old behavior is correct, but I'd have to look at it
>>>>>>> again. I haven't looked into it farther than that yet because I'd
>> want
>>>>>>> to tracediff between versions built with different compilers. Since
>>>> they
>>>>>>> would need to find different versions of libraries and can't just run
>>>>>>> from the same command line, it's logistically annoying.
>>>>>>>
>>>>>>> Gabe
>>>>>>>
>>>>>>> On 09/25/11 09:52, nathan binkert wrote:
>>>>>>>> I'm trying to get my python stats changes into the tree, but it
>>>>>>>> appears that one of the regression tests no longer works (zizzer
>>>>>>>> agrees with me):
>>>>>>>>
>>>>>>>>
>> SPARC_FS/tests/opt/long/80.solaris-boot/sparc/solaris/t1000-simple-atomic
>>>>>>>> Gabe, I think you're the only one that's been messing with SPARC.
>> Can
>>>>>>>> you take a look?
>>>>>>>>
>>>>>>>> Nate
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-dev-1Gs4CP2/***@public.gmane.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-10-25 10:04:03 UTC
Permalink
Do you know where the solaris kernel actually is on that disk image? I
can't disassemble it if I don't know which file it is :-P. Ali?

Gabe

On 10/25/11 02:30, Gabe Black wrote:
> Ah, ok, I was just being dumb. All the stdf-s and lddf-s are just moving
> memory around, I think. That way you can load/store 64 bits at a time
> and get it done with fewer instructions. I think those instructions
> themselves can be ignored. I'm also surprised that there would be much
> floating point.
>
> I'm currently building binutils for SPARC, so hopefully I can
> disassemble some things and get a better idea of what's going on. It's
> probably going to be really annoying to figure it out.
>
> Gabe
>
> On 10/25/11 00:32, Steve Reinhardt wrote:
>> Hard to tell... there are larger and larger differences after that point
>> that seem to be cascading from this one, but it takes a while before they
>> diverge completely. I put the trace in /tmp/tracediff-8625.out on zizzer if
>> you want to take a look for yourself.
>>
>> It seems odd that the solaris boot would be doing that much FP in any case,
>> but there does seem to be quite a bit of it.
>>
>> Steve
>>
>>
>> On Tue, Oct 25, 2011 at 12:17 AM, Gabe Black <gblack-***@public.gmane.org> wrote:
>>
>>> An FP rounding error seems very plausible, but I'm not sure how +/- zero
>>> would make any difference. I'm skeptical that our FP implementation in
>>> SPARC is accurate enough to care much about such a small difference,
>>> although it is, of course, entirely possible it cascades from there into
>>> a larger difference which breaks things.
>>>
>>> I've gone back and improved the SPARC disassembly in the past, but it's
>>> still not perfect. The problem is the hierarchy that works for getting
>>> instructions to work doesn't necessarily mirror the one you need to get
>>> accurate disassembly. I think I went with operand position too (src 0 is
>>> for this, dest 0 is for that) and that doesn't always work very well.
>>> That's probably what's going wrong here.
>>>
>>> Is there a point after this where things diverge significantly? This
>>> could be just a blip of noise and the real problem happens a lot later.
>>> It's a *major* pain in the butt to write code that theoretically handles
>>> all the little FP weird cases and gets all the bits right when the host
>>> ISA has different rules for FP than the guiest, and it's even harder to
>>> actually get the compiler to generate that code without moving things
>>> around and messing it all up. And glibc's FP support is wrong sometimes!
>>> What fun. I largely think it's farther on, and also partially am holding
>>> out hope we don't have to wade into FP soup.
>>>
>>> Gabe
>>>
>>> On 10/24/11 09:19, Steve Reinhardt wrote:
>>>> Great, thanks a lot. I was able to build with
>>>> 'CC=/usr/bin/gcc-4.4 CXX=/usr/bin/g++-4.4' and get a binary that passes
>>> this
>>>> test on the head, so it's definitely the compiler. I also ran tracediff
>>> and
>>>> it looks like it's an off-by-one thing with %fp; here's the first error:
>>>>
>>>> -931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
>>> -0x20] :
>>>> MemWrite : D=0x423000000000197a A=0xfeffa280
>>>> +931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
>>> -0x20] :
>>>> MemWrite : D=0x4230000000001979 A=0xfeffa280
>>>>
>>>> (The good gcc-4.4 version is second, so the '1979' is the correct value
>>>> here.)
>>>>
>>>> I ran one more tracediff with '--debug-flag=All --trace-start=931600000'
>>> to
>>>> see if anything else turns up sooner, and got this:
>>>>
>>>> @@ -1380553 +1380553 @@
>>>> 931697014: system.cpu.[tid:0]: Reading float reg 3 (3) bits as 0, 0.
>>>> 931697014: system.cpu.[tid:0]: Reading float reg 2 (2) bits as
>>> 0x3e300000,
>>>> 0.171875.
>>>> 931697014: global: FSR read as: 0xc0000000
>>>> -931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to 0, 0.
>>>> +931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to
>>>> 0x80000000, -0.
>>>> 931697014: system.cpu.[tid:0]: Setting float reg 13 (13) bits to 0, 0.
>>>> 931697014: global: FSR written with: 0xc0000000
>>>> 931697014: system.cpu + A16 T0 : 0xff1aa434 : fsubd
>>>> %f31,%f30,%f12 : FloatAdd : D=0x00000000c0000000
>>>> @@ -1380951 +1380951 @@
>>>> 931697038: system.cpu.[tid:0]: Reading float reg 5 (5) bits as 0, 0.
>>>> 931697038: system.cpu.[tid:0]: Reading float reg 4 (4) bits as 0, 0.
>>>> 931697038: system.cpu.[tid:0]: Reading float reg 13 (13) bits as 0, 0.
>>>> -931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as 0, 0.
>>>> +931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as
>>>> 0x80000000, -0.
>>>> 931697038: global: FSR read as: 0xc0000000
>>>> 931697038: system.cpu.[tid:0]: Setting float reg 18 (18) bits to 0, 0.
>>>> 931697038: system.cpu.[tid:0]: Setting float reg 19 (19) bits to 0, 0.
>>>> @@ -1381022 +1381022 @@
>>>> 931697042: system.cpu.[tid:0]: Reading float reg 10 (10) bits as
>>>> 0x41300000, 11.
>>>> 931697042: global: FSR read as: 0xc0000000
>>>> 931697042: system.cpu.[tid:0]: Setting float reg 16 (16) bits to
>>>> 0x41300000, 11.
>>>> -931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe685,
>>>> 8.26948e-41.
>>>> +931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe684,
>>>> 8.26934e-41.
>>>> 931697042: global: FSR written with: 0xc0000000
>>>> 931697042: system.cpu + A16 T0 : 0xff1aa4a4 : faddd
>>> %f3,%f2,%f16
>>>> : FloatAdd : D=0x00000000c0000000
>>>> 931697042: Event_18: AtomicSimpleCPU tick event scheduled @ 931697043
>>>>
>>>> Could it be some kind of FP rounding error? It's not clear how that
>>> would
>>>> end up affecting %fp though. (Actually, looking at this a little closer,
>>>> are we even disassembling that correctly? Seems to me it should be 'stdf
>>>> %f29, [%fp + -0x20]'.)
>>>>
>>>> I won't have time to look into this further anytime soon, but I hope this
>>>> will give someone else (Gabe?) enough to go on to get this figured out.
>>>>
>>>> Thanks,
>>>>
>>>> Steve
>>>>
>>>>
>>>> On Sun, Oct 23, 2011 at 7:50 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>>>>
>>>>> I've installed it.
>>>>>
>>>>> Ali
>>>>>
>>>>> On Oct 23, 2011, at 7:18 PM, Steve Reinhardt wrote:
>>>>>
>>>>>> This makes sense, since the time the regression started failing is
>>>>>> consistent with when gcc was upgraded on zizzer.
>>>>>>
>>>>>> I see there is a gcc-4.4 package available for ubuntu 11.04 (which
>>> zizzer
>>>>> is
>>>>>> running)... is there more to it than installing that package and
>>>>> recompiling
>>>>>> to get a workable binary to run tracediff with?
>>>>>>
>>>>>> I'd try myself but I've forgotten my zizzer password (again!) so I
>>> can't
>>>>>> sudo. It's tough when you've had the same password for ten years then
>>>>> you
>>>>>> change it but don't use the new one much...
>>>>>>
>>>>>> Steve
>>>>>>
>>>>>> On Sun, Sep 25, 2011 at 1:14 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>>>>>>
>>>>>>> Yes.. What Gabe said. With gcc 4.5 (version zizzer now runs) I cannot
>>>>> find
>>>>>>> a version of the repository that passes sparc boot. I'm pretty sure
>>>>> it's an
>>>>>>> annoying compiler issue, but there are some annoyances is figuring out
>>>>> where
>>>>>>> to look at Gabe points out. If you're stats changes work on everything
>>>>> else,
>>>>>>> I'm happy to see them committed while this issue goes on in the
>>>>> background.
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Ali
>>>>>>>
>>>>>>> Sent from my ARM powered device
>>>>>>>
>>>>>>> On Sep 25, 2011, at 3:06 PM, Gabe Black <gblack-***@public.gmane.org>
>>> wrote:
>>>>>>>> We (Ali and I) have each looked at that before, and we think it
>>> depends
>>>>>>>> on the compiler version. Something changes when you have a new enough
>>>>>>>> gcc and then the behavior of SPARC changes. I think the new behavior
>>> is
>>>>>>>> broken and the old behavior is correct, but I'd have to look at it
>>>>>>>> again. I haven't looked into it farther than that yet because I'd
>>> want
>>>>>>>> to tracediff between versions built with different compilers. Since
>>>>> they
>>>>>>>> would need to find different versions of libraries and can't just run
>>>>>>>> from the same command line, it's logistically annoying.
>>>>>>>>
>>>>>>>> Gabe
>>>>>>>>
>>>>>>>> On 09/25/11 09:52, nathan binkert wrote:
>>>>>>>>> I'm trying to get my python stats changes into the tree, but it
>>>>>>>>> appears that one of the regression tests no longer works (zizzer
>>>>>>>>> agrees with me):
>>>>>>>>>
>>>>>>>>>
>>> SPARC_FS/tests/opt/long/80.solaris-boot/sparc/solaris/t1000-simple-atomic
>>>>>>>>> Gabe, I think you're the only one that's been messing with SPARC.
>>> Can
>>>>>>>>> you take a look?
>>>>>>>>>
>>>>>>>>> Nate
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-dev mailing list
>>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-dev-1Gs4CP2/***@public.gmane.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Ali Saidi
2011-10-25 12:24:45 UTC
Permalink
It's not a file, it's a bunch of objects in a directory hierarchy. Think kernel modules for nearly everything.

Ali

Sent from my ARM powered mobile device

On Oct 25, 2011, at 5:04 AM, Gabe Black <gblack-***@public.gmane.org> wrote:

> Do you know where the solaris kernel actually is on that disk image? I
> can't disassemble it if I don't know which file it is :-P. Ali?
>
> Gabe
>
> On 10/25/11 02:30, Gabe Black wrote:
>> Ah, ok, I was just being dumb. All the stdf-s and lddf-s are just moving
>> memory around, I think. That way you can load/store 64 bits at a time
>> and get it done with fewer instructions. I think those instructions
>> themselves can be ignored. I'm also surprised that there would be much
>> floating point.
>>
>> I'm currently building binutils for SPARC, so hopefully I can
>> disassemble some things and get a better idea of what's going on. It's
>> probably going to be really annoying to figure it out.
>>
>> Gabe
>>
>> On 10/25/11 00:32, Steve Reinhardt wrote:
>>> Hard to tell... there are larger and larger differences after that point
>>> that seem to be cascading from this one, but it takes a while before they
>>> diverge completely. I put the trace in /tmp/tracediff-8625.out on zizzer if
>>> you want to take a look for yourself.
>>>
>>> It seems odd that the solaris boot would be doing that much FP in any case,
>>> but there does seem to be quite a bit of it.
>>>
>>> Steve
>>>
>>>
>>> On Tue, Oct 25, 2011 at 12:17 AM, Gabe Black <gblack-***@public.gmane.org> wrote:
>>>
>>>> An FP rounding error seems very plausible, but I'm not sure how +/- zero
>>>> would make any difference. I'm skeptical that our FP implementation in
>>>> SPARC is accurate enough to care much about such a small difference,
>>>> although it is, of course, entirely possible it cascades from there into
>>>> a larger difference which breaks things.
>>>>
>>>> I've gone back and improved the SPARC disassembly in the past, but it's
>>>> still not perfect. The problem is the hierarchy that works for getting
>>>> instructions to work doesn't necessarily mirror the one you need to get
>>>> accurate disassembly. I think I went with operand position too (src 0 is
>>>> for this, dest 0 is for that) and that doesn't always work very well.
>>>> That's probably what's going wrong here.
>>>>
>>>> Is there a point after this where things diverge significantly? This
>>>> could be just a blip of noise and the real problem happens a lot later.
>>>> It's a *major* pain in the butt to write code that theoretically handles
>>>> all the little FP weird cases and gets all the bits right when the host
>>>> ISA has different rules for FP than the guiest, and it's even harder to
>>>> actually get the compiler to generate that code without moving things
>>>> around and messing it all up. And glibc's FP support is wrong sometimes!
>>>> What fun. I largely think it's farther on, and also partially am holding
>>>> out hope we don't have to wade into FP soup.
>>>>
>>>> Gabe
>>>>
>>>> On 10/24/11 09:19, Steve Reinhardt wrote:
>>>>> Great, thanks a lot. I was able to build with
>>>>> 'CC=/usr/bin/gcc-4.4 CXX=/usr/bin/g++-4.4' and get a binary that passes
>>>> this
>>>>> test on the head, so it's definitely the compiler. I also ran tracediff
>>>> and
>>>>> it looks like it's an off-by-one thing with %fp; here's the first error:
>>>>>
>>>>> -931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
>>>> -0x20] :
>>>>> MemWrite : D=0x423000000000197a A=0xfeffa280
>>>>> +931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
>>>> -0x20] :
>>>>> MemWrite : D=0x4230000000001979 A=0xfeffa280
>>>>>
>>>>> (The good gcc-4.4 version is second, so the '1979' is the correct value
>>>>> here.)
>>>>>
>>>>> I ran one more tracediff with '--debug-flag=All --trace-start=931600000'
>>>> to
>>>>> see if anything else turns up sooner, and got this:
>>>>>
>>>>> @@ -1380553 +1380553 @@
>>>>> 931697014: system.cpu.[tid:0]: Reading float reg 3 (3) bits as 0, 0.
>>>>> 931697014: system.cpu.[tid:0]: Reading float reg 2 (2) bits as
>>>> 0x3e300000,
>>>>> 0.171875.
>>>>> 931697014: global: FSR read as: 0xc0000000
>>>>> -931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to 0, 0.
>>>>> +931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to
>>>>> 0x80000000, -0.
>>>>> 931697014: system.cpu.[tid:0]: Setting float reg 13 (13) bits to 0, 0.
>>>>> 931697014: global: FSR written with: 0xc0000000
>>>>> 931697014: system.cpu + A16 T0 : 0xff1aa434 : fsubd
>>>>> %f31,%f30,%f12 : FloatAdd : D=0x00000000c0000000
>>>>> @@ -1380951 +1380951 @@
>>>>> 931697038: system.cpu.[tid:0]: Reading float reg 5 (5) bits as 0, 0.
>>>>> 931697038: system.cpu.[tid:0]: Reading float reg 4 (4) bits as 0, 0.
>>>>> 931697038: system.cpu.[tid:0]: Reading float reg 13 (13) bits as 0, 0.
>>>>> -931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as 0, 0.
>>>>> +931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as
>>>>> 0x80000000, -0.
>>>>> 931697038: global: FSR read as: 0xc0000000
>>>>> 931697038: system.cpu.[tid:0]: Setting float reg 18 (18) bits to 0, 0.
>>>>> 931697038: system.cpu.[tid:0]: Setting float reg 19 (19) bits to 0, 0.
>>>>> @@ -1381022 +1381022 @@
>>>>> 931697042: system.cpu.[tid:0]: Reading float reg 10 (10) bits as
>>>>> 0x41300000, 11.
>>>>> 931697042: global: FSR read as: 0xc0000000
>>>>> 931697042: system.cpu.[tid:0]: Setting float reg 16 (16) bits to
>>>>> 0x41300000, 11.
>>>>> -931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe685,
>>>>> 8.26948e-41.
>>>>> +931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe684,
>>>>> 8.26934e-41.
>>>>> 931697042: global: FSR written with: 0xc0000000
>>>>> 931697042: system.cpu + A16 T0 : 0xff1aa4a4 : faddd
>>>> %f3,%f2,%f16
>>>>> : FloatAdd : D=0x00000000c0000000
>>>>> 931697042: Event_18: AtomicSimpleCPU tick event scheduled @ 931697043
>>>>>
>>>>> Could it be some kind of FP rounding error? It's not clear how that
>>>> would
>>>>> end up affecting %fp though. (Actually, looking at this a little closer,
>>>>> are we even disassembling that correctly? Seems to me it should be 'stdf
>>>>> %f29, [%fp + -0x20]'.)
>>>>>
>>>>> I won't have time to look into this further anytime soon, but I hope this
>>>>> will give someone else (Gabe?) enough to go on to get this figured out.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Steve
>>>>>
>>>>>
>>>>> On Sun, Oct 23, 2011 at 7:50 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>>>>>
>>>>>> I've installed it.
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>> On Oct 23, 2011, at 7:18 PM, Steve Reinhardt wrote:
>>>>>>
>>>>>>> This makes sense, since the time the regression started failing is
>>>>>>> consistent with when gcc was upgraded on zizzer.
>>>>>>>
>>>>>>> I see there is a gcc-4.4 package available for ubuntu 11.04 (which
>>>> zizzer
>>>>>> is
>>>>>>> running)... is there more to it than installing that package and
>>>>>> recompiling
>>>>>>> to get a workable binary to run tracediff with?
>>>>>>>
>>>>>>> I'd try myself but I've forgotten my zizzer password (again!) so I
>>>> can't
>>>>>>> sudo. It's tough when you've had the same password for ten years then
>>>>>> you
>>>>>>> change it but don't use the new one much...
>>>>>>>
>>>>>>> Steve
>>>>>>>
>>>>>>> On Sun, Sep 25, 2011 at 1:14 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>>>>>>>
>>>>>>>> Yes.. What Gabe said. With gcc 4.5 (version zizzer now runs) I cannot
>>>>>> find
>>>>>>>> a version of the repository that passes sparc boot. I'm pretty sure
>>>>>> it's an
>>>>>>>> annoying compiler issue, but there are some annoyances is figuring out
>>>>>> where
>>>>>>>> to look at Gabe points out. If you're stats changes work on everything
>>>>>> else,
>>>>>>>> I'm happy to see them committed while this issue goes on in the
>>>>>> background.
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Ali
>>>>>>>>
>>>>>>>> Sent from my ARM powered device
>>>>>>>>
>>>>>>>> On Sep 25, 2011, at 3:06 PM, Gabe Black <gblack-***@public.gmane.org>
>>>> wrote:
>>>>>>>>> We (Ali and I) have each looked at that before, and we think it
>>>> depends
>>>>>>>>> on the compiler version. Something changes when you have a new enough
>>>>>>>>> gcc and then the behavior of SPARC changes. I think the new behavior
>>>> is
>>>>>>>>> broken and the old behavior is correct, but I'd have to look at it
>>>>>>>>> again. I haven't looked into it farther than that yet because I'd
>>>> want
>>>>>>>>> to tracediff between versions built with different compilers. Since
>>>>>> they
>>>>>>>>> would need to find different versions of libraries and can't just run
>>>>>>>>> from the same command line, it's logistically annoying.
>>>>>>>>>
>>>>>>>>> Gabe
>>>>>>>>>
>>>>>>>>> On 09/25/11 09:52, nathan binkert wrote:
>>>>>>>>>> I'm trying to get my python stats changes into the tree, but it
>>>>>>>>>> appears that one of the regression tests no longer works (zizzer
>>>>>>>>>> agrees with me):
>>>>>>>>>>
>>>>>>>>>>
>>>> SPARC_FS/tests/opt/long/80.solaris-boot/sparc/solaris/t1000-simple-atomic
>>>>>>>>>> Gabe, I think you're the only one that's been messing with SPARC.
>>>> Can
>>>>>>>>>> you take a look?
>>>>>>>>>>
>>>>>>>>>> Nate
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-dev mailing list
>>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-dev-1Gs4CP2/***@public.gmane.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabe Black
2011-10-25 18:38:30 UTC
Permalink
I think that means things are dynamically relocated and it's basically
impossible to associate running code with a binary file and/or source?
That's unfortunate :-/.

Gabe

On 10/25/11 05:24, Ali Saidi wrote:
> It's not a file, it's a bunch of objects in a directory hierarchy. Think kernel modules for nearly everything.
>
> Ali
>
> Sent from my ARM powered mobile device
>
> On Oct 25, 2011, at 5:04 AM, Gabe Black <gblack-***@public.gmane.org> wrote:
>
>> Do you know where the solaris kernel actually is on that disk image? I
>> can't disassemble it if I don't know which file it is :-P. Ali?
>>
>> Gabe
>>
>> On 10/25/11 02:30, Gabe Black wrote:
>>> Ah, ok, I was just being dumb. All the stdf-s and lddf-s are just moving
>>> memory around, I think. That way you can load/store 64 bits at a time
>>> and get it done with fewer instructions. I think those instructions
>>> themselves can be ignored. I'm also surprised that there would be much
>>> floating point.
>>>
>>> I'm currently building binutils for SPARC, so hopefully I can
>>> disassemble some things and get a better idea of what's going on. It's
>>> probably going to be really annoying to figure it out.
>>>
>>> Gabe
>>>
>>> On 10/25/11 00:32, Steve Reinhardt wrote:
>>>> Hard to tell... there are larger and larger differences after that point
>>>> that seem to be cascading from this one, but it takes a while before they
>>>> diverge completely. I put the trace in /tmp/tracediff-8625.out on zizzer if
>>>> you want to take a look for yourself.
>>>>
>>>> It seems odd that the solaris boot would be doing that much FP in any case,
>>>> but there does seem to be quite a bit of it.
>>>>
>>>> Steve
>>>>
>>>>
>>>> On Tue, Oct 25, 2011 at 12:17 AM, Gabe Black <gblack-***@public.gmane.org> wrote:
>>>>
>>>>> An FP rounding error seems very plausible, but I'm not sure how +/- zero
>>>>> would make any difference. I'm skeptical that our FP implementation in
>>>>> SPARC is accurate enough to care much about such a small difference,
>>>>> although it is, of course, entirely possible it cascades from there into
>>>>> a larger difference which breaks things.
>>>>>
>>>>> I've gone back and improved the SPARC disassembly in the past, but it's
>>>>> still not perfect. The problem is the hierarchy that works for getting
>>>>> instructions to work doesn't necessarily mirror the one you need to get
>>>>> accurate disassembly. I think I went with operand position too (src 0 is
>>>>> for this, dest 0 is for that) and that doesn't always work very well.
>>>>> That's probably what's going wrong here.
>>>>>
>>>>> Is there a point after this where things diverge significantly? This
>>>>> could be just a blip of noise and the real problem happens a lot later.
>>>>> It's a *major* pain in the butt to write code that theoretically handles
>>>>> all the little FP weird cases and gets all the bits right when the host
>>>>> ISA has different rules for FP than the guiest, and it's even harder to
>>>>> actually get the compiler to generate that code without moving things
>>>>> around and messing it all up. And glibc's FP support is wrong sometimes!
>>>>> What fun. I largely think it's farther on, and also partially am holding
>>>>> out hope we don't have to wade into FP soup.
>>>>>
>>>>> Gabe
>>>>>
>>>>> On 10/24/11 09:19, Steve Reinhardt wrote:
>>>>>> Great, thanks a lot. I was able to build with
>>>>>> 'CC=/usr/bin/gcc-4.4 CXX=/usr/bin/g++-4.4' and get a binary that passes
>>>>> this
>>>>>> test on the head, so it's definitely the compiler. I also ran tracediff
>>>>> and
>>>>>> it looks like it's an off-by-one thing with %fp; here's the first error:
>>>>>>
>>>>>> -931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
>>>>> -0x20] :
>>>>>> MemWrite : D=0x423000000000197a A=0xfeffa280
>>>>>> +931697720: system.cpu T0 : 0xff1aa5b8 : stdf %fp, [%f29 +
>>>>> -0x20] :
>>>>>> MemWrite : D=0x4230000000001979 A=0xfeffa280
>>>>>>
>>>>>> (The good gcc-4.4 version is second, so the '1979' is the correct value
>>>>>> here.)
>>>>>>
>>>>>> I ran one more tracediff with '--debug-flag=All --trace-start=931600000'
>>>>> to
>>>>>> see if anything else turns up sooner, and got this:
>>>>>>
>>>>>> @@ -1380553 +1380553 @@
>>>>>> 931697014: system.cpu.[tid:0]: Reading float reg 3 (3) bits as 0, 0.
>>>>>> 931697014: system.cpu.[tid:0]: Reading float reg 2 (2) bits as
>>>>> 0x3e300000,
>>>>>> 0.171875.
>>>>>> 931697014: global: FSR read as: 0xc0000000
>>>>>> -931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to 0, 0.
>>>>>> +931697014: system.cpu.[tid:0]: Setting float reg 12 (12) bits to
>>>>>> 0x80000000, -0.
>>>>>> 931697014: system.cpu.[tid:0]: Setting float reg 13 (13) bits to 0, 0.
>>>>>> 931697014: global: FSR written with: 0xc0000000
>>>>>> 931697014: system.cpu + A16 T0 : 0xff1aa434 : fsubd
>>>>>> %f31,%f30,%f12 : FloatAdd : D=0x00000000c0000000
>>>>>> @@ -1380951 +1380951 @@
>>>>>> 931697038: system.cpu.[tid:0]: Reading float reg 5 (5) bits as 0, 0.
>>>>>> 931697038: system.cpu.[tid:0]: Reading float reg 4 (4) bits as 0, 0.
>>>>>> 931697038: system.cpu.[tid:0]: Reading float reg 13 (13) bits as 0, 0.
>>>>>> -931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as 0, 0.
>>>>>> +931697038: system.cpu.[tid:0]: Reading float reg 12 (12) bits as
>>>>>> 0x80000000, -0.
>>>>>> 931697038: global: FSR read as: 0xc0000000
>>>>>> 931697038: system.cpu.[tid:0]: Setting float reg 18 (18) bits to 0, 0.
>>>>>> 931697038: system.cpu.[tid:0]: Setting float reg 19 (19) bits to 0, 0.
>>>>>> @@ -1381022 +1381022 @@
>>>>>> 931697042: system.cpu.[tid:0]: Reading float reg 10 (10) bits as
>>>>>> 0x41300000, 11.
>>>>>> 931697042: global: FSR read as: 0xc0000000
>>>>>> 931697042: system.cpu.[tid:0]: Setting float reg 16 (16) bits to
>>>>>> 0x41300000, 11.
>>>>>> -931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe685,
>>>>>> 8.26948e-41.
>>>>>> +931697042: system.cpu.[tid:0]: Setting float reg 17 (17) bits to 0xe684,
>>>>>> 8.26934e-41.
>>>>>> 931697042: global: FSR written with: 0xc0000000
>>>>>> 931697042: system.cpu + A16 T0 : 0xff1aa4a4 : faddd
>>>>> %f3,%f2,%f16
>>>>>> : FloatAdd : D=0x00000000c0000000
>>>>>> 931697042: Event_18: AtomicSimpleCPU tick event scheduled @ 931697043
>>>>>>
>>>>>> Could it be some kind of FP rounding error? It's not clear how that
>>>>> would
>>>>>> end up affecting %fp though. (Actually, looking at this a little closer,
>>>>>> are we even disassembling that correctly? Seems to me it should be 'stdf
>>>>>> %f29, [%fp + -0x20]'.)
>>>>>>
>>>>>> I won't have time to look into this further anytime soon, but I hope this
>>>>>> will give someone else (Gabe?) enough to go on to get this figured out.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Steve
>>>>>>
>>>>>>
>>>>>> On Sun, Oct 23, 2011 at 7:50 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>>>>>>
>>>>>>> I've installed it.
>>>>>>>
>>>>>>> Ali
>>>>>>>
>>>>>>> On Oct 23, 2011, at 7:18 PM, Steve Reinhardt wrote:
>>>>>>>
>>>>>>>> This makes sense, since the time the regression started failing is
>>>>>>>> consistent with when gcc was upgraded on zizzer.
>>>>>>>>
>>>>>>>> I see there is a gcc-4.4 package available for ubuntu 11.04 (which
>>>>> zizzer
>>>>>>> is
>>>>>>>> running)... is there more to it than installing that package and
>>>>>>> recompiling
>>>>>>>> to get a workable binary to run tracediff with?
>>>>>>>>
>>>>>>>> I'd try myself but I've forgotten my zizzer password (again!) so I
>>>>> can't
>>>>>>>> sudo. It's tough when you've had the same password for ten years then
>>>>>>> you
>>>>>>>> change it but don't use the new one much...
>>>>>>>>
>>>>>>>> Steve
>>>>>>>>
>>>>>>>> On Sun, Sep 25, 2011 at 1:14 PM, Ali Saidi <saidi-63aXycvo3TyHXe+***@public.gmane.org> wrote:
>>>>>>>>
>>>>>>>>> Yes.. What Gabe said. With gcc 4.5 (version zizzer now runs) I cannot
>>>>>>> find
>>>>>>>>> a version of the repository that passes sparc boot. I'm pretty sure
>>>>>>> it's an
>>>>>>>>> annoying compiler issue, but there are some annoyances is figuring out
>>>>>>> where
>>>>>>>>> to look at Gabe points out. If you're stats changes work on everything
>>>>>>> else,
>>>>>>>>> I'm happy to see them committed while this issue goes on in the
>>>>>>> background.
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Ali
>>>>>>>>>
>>>>>>>>> Sent from my ARM powered device
>>>>>>>>>
>>>>>>>>> On Sep 25, 2011, at 3:06 PM, Gabe Black <gblack-***@public.gmane.org>
>>>>> wrote:
>>>>>>>>>> We (Ali and I) have each looked at that before, and we think it
>>>>> depends
>>>>>>>>>> on the compiler version. Something changes when you have a new enough
>>>>>>>>>> gcc and then the behavior of SPARC changes. I think the new behavior
>>>>> is
>>>>>>>>>> broken and the old behavior is correct, but I'd have to look at it
>>>>>>>>>> again. I haven't looked into it farther than that yet because I'd
>>>>> want
>>>>>>>>>> to tracediff between versions built with different compilers. Since
>>>>>>> they
>>>>>>>>>> would need to find different versions of libraries and can't just run
>>>>>>>>>> from the same command line, it's logistically annoying.
>>>>>>>>>>
>>>>>>>>>> Gabe
>>>>>>>>>>
>>>>>>>>>> On 09/25/11 09:52, nathan binkert wrote:
>>>>>>>>>>> I'm trying to get my python stats changes into the tree, but it
>>>>>>>>>>> appears that one of the regression tests no longer works (zizzer
>>>>>>>>>>> agrees with me):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>> SPARC_FS/tests/opt/long/80.solaris-boot/sparc/solaris/t1000-simple-atomic
>>>>>>>>>>> Gabe, I think you're the only one that's been messing with SPARC.
>>>>> Can
>>>>>>>>>>> you take a look?
>>>>>>>>>>>
>>>>>>>>>>> Nate
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-dev mailing list
>>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-dev-1Gs4CP2/***@public.gmane.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-10-25 14:46:04 UTC
Permalink
On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <gblack-***@public.gmane.org> wrote:

> Ah, ok, I was just being dumb. All the stdf-s and lddf-s are just moving
> memory around, I think. That way you can load/store 64 bits at a time
> and get it done with fewer instructions. I think those instructions
> themselves can be ignored.


If what you mean is that the actual problem is induced in an FP operation
and not in the stdf/lddf itself, then yes, it looks like you're right. Note
that in the detailed tracediff below, the original divergence is on the
result of an fsubd. I think there are quite a few FP ops that are giving
slightly different results before one shows up in the exec trace, and the
reason appears to be that the data field output on FP op exec tracing is
broken... maybe we're only properly reading one register from the register
pair? So I think the only reason the error first shows up in a stdf in the
exec trace is because that's the first instruction where the trace output
isn't broken.

I created a /tmp/sparc-error directory on zizzer, moved the original
tracediff in there, and also copied two new files: pre-error-trace.out and
detailed-tracediff.out. Hope the names are self-explanatory. Now you have
access to all the traces I generated.



> I'm also surprised that there would be much
> floating point.
>

Yea, and it's really weird stuff too... almost like they're running tests on
the FPU or something:

931697674: system.cpu T0 : 0xff1aa4b0 : faddd %f21,%f20,%f20 :
FloatAdd : D=0x00000000c0000000
931697675: system.cpu T0 : 0xff1aa4b4 : fsubd %f17,%f16,%f28 :
FloatAdd : D=0x00000000c0000000
931697676: system.cpu T0 : 0xff1aa4b8 : faddd %f19,%f18,%f4 :
FloatAdd : D=0x00000000c0000000
931697677: system.cpu T0 : 0xff1aa4bc : fsubd %f3,%f2,%f0 :
FloatAdd : D=0x00000000c0000000
931697678: system.cpu T0 : 0xff1aa4c0 : faddd %f7,%f6,%f14 :
FloatAdd : D=0x00000000c0000000
931697679: system.cpu T0 : 0xff1aa4c4 : fsubd %f5,%f4,%f30 :
FloatAdd : D=0x00000000c0000000
931697680: system.cpu T0 : 0xff1aa4c8 : faddd %f11,%f10,%f6 :
FloatAdd : D=0x00000000c0000000
931697681: system.cpu T0 : 0xff1aa4cc : fcmpd %f21,%f20,%fsr :
FloatAdd : D=0x00000000c0000000
931697682: system.cpu T0 : 0xff1aa4d0 : faddd %f7,%f6,%f18 :
FloatAdd : D=0x00000000c0000000

Note also how the data field in the trace output is always the same, even
though the detailed tracediff shows that these instructions aren't always
producing the same values.

>
> I'm currently building binutils for SPARC, so hopefully I can
> disassemble some things and get a better idea of what's going on. It's
> probably going to be really annoying to figure it out.


If it's really just an FP rounding error, it might not be that hard... just
look at the examples from the trace of where it's going wrong, figure out
what the right answer is, and focus on those few instructions. FP is pretty
thoroughly specified by IEEE, so if it's not an outright compiler bug, maybe
it's just some change in the default rounding settings or something.

Even if the FP rounding error isn't the source of the problem, it might be
easiest to fix that and get it out of the way so we can see what the actual
problem is.

If you really want to know *why* the kernel is doing all this FP, then yes,
you probably need to look at the source code.

Steve
Gabe Black
2011-10-25 18:53:29 UTC
Permalink
On 10/25/11 07:46, Steve Reinhardt wrote:
> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <gblack-***@public.gmane.org> wrote:
>
>> Ah, ok, I was just being dumb. All the stdf-s and lddf-s are just moving
>> memory around, I think. That way you can load/store 64 bits at a time
>> and get it done with fewer instructions. I think those instructions
>> themselves can be ignored.
>
> If what you mean is that the actual problem is induced in an FP operation
> and not in the stdf/lddf itself, then yes, it looks like you're right. Note
> that in the detailed tracediff below, the original divergence is on the
> result of an fsubd. I think there are quite a few FP ops that are giving
> slightly different results before one shows up in the exec trace, and the
> reason appears to be that the data field output on FP op exec tracing is
> broken... maybe we're only properly reading one register from the register
> pair? So I think the only reason the error first shows up in a stdf in the
> exec trace is because that's the first instruction where the trace output
> isn't broken.

It's not that it's broken, it's that it sets more than one register. One
or two will be the fp result, and one is for the FP condition codes.
Registers are ordered like they are so the disassembly can figure out
(sort of) which registers to use for which purpose, and as a result the
condition codes tend to be the thing picked for both integer and FP
instructions. That's frequently not very useful, but even if it was an
FP dest reg it wouldn't be both of them for double precision
instructions. If you want the real story with how registers are being
read/written, use the "Registers" trace flag. If it's not that it's
similar. That will print out what registers are being accessed, how, and
what value is being passed around.


> I created a /tmp/sparc-error directory on zizzer, moved the original
> tracediff in there, and also copied two new files: pre-error-trace.out and
> detailed-tracediff.out. Hope the names are self-explanatory. Now you have
> access to all the traces I generated.

Ok, thanks.

>
>
>> I'm also surprised that there would be much
>> floating point.
>>
> Yea, and it's really weird stuff too... almost like they're running tests on
> the FPU or something:
>
> 931697674: system.cpu T0 : 0xff1aa4b0 : faddd %f21,%f20,%f20 :
> FloatAdd : D=0x00000000c0000000
> 931697675: system.cpu T0 : 0xff1aa4b4 : fsubd %f17,%f16,%f28 :
> FloatAdd : D=0x00000000c0000000
> 931697676: system.cpu T0 : 0xff1aa4b8 : faddd %f19,%f18,%f4 :
> FloatAdd : D=0x00000000c0000000
> 931697677: system.cpu T0 : 0xff1aa4bc : fsubd %f3,%f2,%f0 :
> FloatAdd : D=0x00000000c0000000
> 931697678: system.cpu T0 : 0xff1aa4c0 : faddd %f7,%f6,%f14 :
> FloatAdd : D=0x00000000c0000000
> 931697679: system.cpu T0 : 0xff1aa4c4 : fsubd %f5,%f4,%f30 :
> FloatAdd : D=0x00000000c0000000
> 931697680: system.cpu T0 : 0xff1aa4c8 : faddd %f11,%f10,%f6 :
> FloatAdd : D=0x00000000c0000000
> 931697681: system.cpu T0 : 0xff1aa4cc : fcmpd %f21,%f20,%fsr :
> FloatAdd : D=0x00000000c0000000
> 931697682: system.cpu T0 : 0xff1aa4d0 : faddd %f7,%f6,%f18 :
> FloatAdd : D=0x00000000c0000000
>
> Note also how the data field in the trace output is always the same, even
> though the detailed tracediff shows that these instructions aren't always
> producing the same values.

I think they're not setting any FP condition codes differently, really.
It actually could be some sort of boot time self test now that you
mention it.

>> I'm currently building binutils for SPARC, so hopefully I can
>> disassemble some things and get a better idea of what's going on. It's
>> probably going to be really annoying to figure it out.
>
> If it's really just an FP rounding error, it might not be that hard... just
> look at the examples from the trace of where it's going wrong, figure out
> what the right answer is, and focus on those few instructions. FP is pretty
> thoroughly specified by IEEE, so if it's not an outright compiler bug, maybe
> it's just some change in the default rounding settings or something.

Yeah, I think ISAs treat IEEE as a really good suggestion rather than a
standard. ARM isn't strictly conformant, and neither is x86. The default
rounding mode *is* standard, though, and I don't think is adjusted in
SPARC as a result of execution. If it changed somehow (unless I'm
forgetting where SPARC does that) it's a fairly significant problem.
Whether instructions generate +/- 0 in various situations may depend on,
for instance, what order gcc decides to put the operands. I'm not sure
that it does, but there are all kinds of weird, subtle behaviors with
FP, and you can't just fix how add works if x86 picked the wrong thing.
Then you have to replace add, or semi-replace it by faking it out with
other FP operations. If we're running real x87 instructions (we
shouldn't be in 64 bit mode, but we still could) then those use 80 bit
operands internally. Where and when rounding takes place depends on when
those are moved in/out of the FPU, and will be different than true 64
bit operands. SSE based FP uses real 64 bit doubles, so that should
behave better. It should also be the default in 64 bit mode since the
compiler can assume some basic SSE support is present.

> Even if the FP rounding error isn't the source of the problem, it might be
> easiest to fix that and get it out of the way so we can see what the actual
> problem is.
>
> If you really want to know *why* the kernel is doing all this FP, then yes,
> you probably need to look at the source code.
>
> Steve
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Ali Saidi
2011-10-26 05:28:53 UTC
Permalink
On Tue, 25 Oct 2011 11:53:29 -0700, Gabe Black <gblack-***@public.gmane.org>
wrote:
> On 10/25/11 07:46, Steve Reinhardt wrote:
>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <gblack-***@public.gmane.org>
>> wrote:

>>> I'm currently building binutils for SPARC, so hopefully I can
>>> disassemble some things and get a better idea of what's going on.
>>> It's
>>> probably going to be really annoying to figure it out.
>>
>> If it's really just an FP rounding error, it might not be that
>> hard... just
>> look at the examples from the trace of where it's going wrong,
>> figure out
>> what the right answer is, and focus on those few instructions. FP
>> is pretty
>> thoroughly specified by IEEE, so if it's not an outright compiler
>> bug, maybe
>> it's just some change in the default rounding settings or something.
>
> Yeah, I think ISAs treat IEEE as a really good suggestion rather than
> a
> standard. ARM isn't strictly conformant, and neither is x86. The
> default
> rounding mode *is* standard, though, and I don't think is adjusted in
> SPARC as a result of execution. If it changed somehow (unless I'm
> forgetting where SPARC does that) it's a fairly significant problem.
> Whether instructions generate +/- 0 in various situations may depend
> on,
> for instance, what order gcc decides to put the operands. I'm not
> sure
> that it does, but there are all kinds of weird, subtle behaviors with
> FP, and you can't just fix how add works if x86 picked the wrong
> thing.
> Then you have to replace add, or semi-replace it by faking it out
> with
> other FP operations. If we're running real x87 instructions (we
> shouldn't be in 64 bit mode, but we still could) then those use 80
> bit
> operands internally. Where and when rounding takes place depends on
> when
> those are moved in/out of the FPU, and will be different than true 64
> bit operands. SSE based FP uses real 64 bit doubles, so that should
> behave better. It should also be the default in 64 bit mode since the
> compiler can assume some basic SSE support is present.

The rounding mode in SPARC is controlled by bits 31:30 of the FSR. My
guess is that this is actually the problem and gcc 4.5+ is doing some
code motion that is moving the actual fp code around our setting of the
rounding mode. Using one of the asm tricks to prevent code movement
(supposedly an empty asm() is supposed to be code barrier in gcc),
might fix the problem. I don't have time to try it, but
src/arch/sparc/isa/formats/basic.isa:145 looks like the right place.
Also, trying to run the regression with m5.debug might see if the
optimizer is at fault.

Ali
Gabe Black
2011-10-26 07:55:40 UTC
Permalink
On 10/25/11 22:28, Ali Saidi wrote:
> On Tue, 25 Oct 2011 11:53:29 -0700, Gabe Black <gblack-***@public.gmane.org>
> wrote:
>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <gblack-***@public.gmane.org>
>>> wrote:
>
>>>> I'm currently building binutils for SPARC, so hopefully I can
>>>> disassemble some things and get a better idea of what's going on. It's
>>>> probably going to be really annoying to figure it out.
>>>
>>> If it's really just an FP rounding error, it might not be that
>>> hard... just
>>> look at the examples from the trace of where it's going wrong,
>>> figure out
>>> what the right answer is, and focus on those few instructions. FP
>>> is pretty
>>> thoroughly specified by IEEE, so if it's not an outright compiler
>>> bug, maybe
>>> it's just some change in the default rounding settings or something.
>>
>> Yeah, I think ISAs treat IEEE as a really good suggestion rather than a
>> standard. ARM isn't strictly conformant, and neither is x86. The default
>> rounding mode *is* standard, though, and I don't think is adjusted in
>> SPARC as a result of execution. If it changed somehow (unless I'm
>> forgetting where SPARC does that) it's a fairly significant problem.
>> Whether instructions generate +/- 0 in various situations may depend on,
>> for instance, what order gcc decides to put the operands. I'm not sure
>> that it does, but there are all kinds of weird, subtle behaviors with
>> FP, and you can't just fix how add works if x86 picked the wrong thing.
>> Then you have to replace add, or semi-replace it by faking it out with
>> other FP operations. If we're running real x87 instructions (we
>> shouldn't be in 64 bit mode, but we still could) then those use 80 bit
>> operands internally. Where and when rounding takes place depends on when
>> those are moved in/out of the FPU, and will be different than true 64
>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>> behave better. It should also be the default in 64 bit mode since the
>> compiler can assume some basic SSE support is present.
>
> The rounding mode in SPARC is controlled by bits 31:30 of the FSR. My
> guess is that this is actually the problem and gcc 4.5+ is doing some
> code motion that is moving the actual fp code around our setting of
> the rounding mode. Using one of the asm tricks to prevent code
> movement (supposedly an empty asm() is supposed to be code barrier in
> gcc), might fix the problem. I don't have time to try it, but
> src/arch/sparc/isa/formats/basic.isa:145 looks like the right place.
> Also, trying to run the regression with m5.debug might see if the
> optimizer is at fault.
>
> Ali
>
>
>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev

Ah, ok, so we do set the mode apparently. I'll try gem5.debug and also
look at that template and see what I can see. Thanks Ali!

Gabe
Steve Reinhardt
2011-10-26 14:10:44 UTC
Permalink
I forgot to mention that I fired off a gem5.debug run before I went to bed
last night, and it completed successfully. So it does appear to be the
optimizer.

Steve

On Wed, Oct 26, 2011 at 12:55 AM, Gabe Black <gblack-***@public.gmane.org> wrote:

> On 10/25/11 22:28, Ali Saidi wrote:
> > On Tue, 25 Oct 2011 11:53:29 -0700, Gabe Black <gblack-***@public.gmane.org>
> > wrote:
> >> On 10/25/11 07:46, Steve Reinhardt wrote:
> >>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <gblack-***@public.gmane.org>
> >>> wrote:
> >
> >>>> I'm currently building binutils for SPARC, so hopefully I can
> >>>> disassemble some things and get a better idea of what's going on. It's
> >>>> probably going to be really annoying to figure it out.
> >>>
> >>> If it's really just an FP rounding error, it might not be that
> >>> hard... just
> >>> look at the examples from the trace of where it's going wrong,
> >>> figure out
> >>> what the right answer is, and focus on those few instructions. FP
> >>> is pretty
> >>> thoroughly specified by IEEE, so if it's not an outright compiler
> >>> bug, maybe
> >>> it's just some change in the default rounding settings or something.
> >>
> >> Yeah, I think ISAs treat IEEE as a really good suggestion rather than a
> >> standard. ARM isn't strictly conformant, and neither is x86. The default
> >> rounding mode *is* standard, though, and I don't think is adjusted in
> >> SPARC as a result of execution. If it changed somehow (unless I'm
> >> forgetting where SPARC does that) it's a fairly significant problem.
> >> Whether instructions generate +/- 0 in various situations may depend on,
> >> for instance, what order gcc decides to put the operands. I'm not sure
> >> that it does, but there are all kinds of weird, subtle behaviors with
> >> FP, and you can't just fix how add works if x86 picked the wrong thing.
> >> Then you have to replace add, or semi-replace it by faking it out with
> >> other FP operations. If we're running real x87 instructions (we
> >> shouldn't be in 64 bit mode, but we still could) then those use 80 bit
> >> operands internally. Where and when rounding takes place depends on when
> >> those are moved in/out of the FPU, and will be different than true 64
> >> bit operands. SSE based FP uses real 64 bit doubles, so that should
> >> behave better. It should also be the default in 64 bit mode since the
> >> compiler can assume some basic SSE support is present.
> >
> > The rounding mode in SPARC is controlled by bits 31:30 of the FSR. My
> > guess is that this is actually the problem and gcc 4.5+ is doing some
> > code motion that is moving the actual fp code around our setting of
> > the rounding mode. Using one of the asm tricks to prevent code
> > movement (supposedly an empty asm() is supposed to be code barrier in
> > gcc), might fix the problem. I don't have time to try it, but
> > src/arch/sparc/isa/formats/basic.isa:145 looks like the right place.
> > Also, trying to run the regression with m5.debug might see if the
> > optimizer is at fault.
> >
> > Ali
> >
> >
> >
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev-1Gs4CP2/***@public.gmane.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
>
> Ah, ok, so we do set the mode apparently. I'll try gem5.debug and also
> look at that template and see what I can see. Thanks Ali!
>
> Gabe
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabe Black
2011-10-27 07:35:28 UTC
Permalink
I'm convinced we've successfully identified the problem, but
unfortunately I added barriers liberally and it still failed.

Gabe

int newrnd = M5_FE_TONEAREST;
switch (Fsr<31:30>) {
case 0: newrnd = M5_FE_TONEAREST; break;
case 1: newrnd = M5_FE_TOWARDZERO; break;
case 2: newrnd = M5_FE_UPWARD; break;
case 3: newrnd = M5_FE_DOWNWARD; break;
}
__asm__ __volatile__ ("" ::: "memory");
int oldrnd = m5_fegetround();
__asm__ __volatile__ ("" ::: "memory");
m5_fesetround(newrnd);
__asm__ __volatile__ ("" ::: "memory");
"""

fp_code += code


fp_code += """
__asm__ __volatile__ ("" ::: "memory");
m5_fesetround(oldrnd);
__asm__ __volatile__ ("" ::: "memory");
"""
fp_code = filterDoubles(fp_code)
iop = InstObjParams(name, Name, 'SparcStaticInst', fp_code, flags)
header_output = BasicDeclare.subst(iop)
decoder_output = BasicConstructor.subst(iop)
decode_block = BasicDecode.subst(iop)
exec_output = BasicExecute.subst(iop)
}};


On 10/26/11 07:10, Steve Reinhardt wrote:
> I forgot to mention that I fired off a gem5.debug run before I went to bed
> last night, and it completed successfully. So it does appear to be the
> optimizer.
>
> Steve
>
> On Wed, Oct 26, 2011 at 12:55 AM, Gabe Black <gblack-***@public.gmane.org> wrote:
>
>> On 10/25/11 22:28, Ali Saidi wrote:
>>> On Tue, 25 Oct 2011 11:53:29 -0700, Gabe Black <gblack-***@public.gmane.org>
>>> wrote:
>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <gblack-***@public.gmane.org>
>>>>> wrote:
>>>>>> I'm currently building binutils for SPARC, so hopefully I can
>>>>>> disassemble some things and get a better idea of what's going on. It's
>>>>>> probably going to be really annoying to figure it out.
>>>>> If it's really just an FP rounding error, it might not be that
>>>>> hard... just
>>>>> look at the examples from the trace of where it's going wrong,
>>>>> figure out
>>>>> what the right answer is, and focus on those few instructions. FP
>>>>> is pretty
>>>>> thoroughly specified by IEEE, so if it's not an outright compiler
>>>>> bug, maybe
>>>>> it's just some change in the default rounding settings or something.
>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather than a
>>>> standard. ARM isn't strictly conformant, and neither is x86. The default
>>>> rounding mode *is* standard, though, and I don't think is adjusted in
>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>> Whether instructions generate +/- 0 in various situations may depend on,
>>>> for instance, what order gcc decides to put the operands. I'm not sure
>>>> that it does, but there are all kinds of weird, subtle behaviors with
>>>> FP, and you can't just fix how add works if x86 picked the wrong thing.
>>>> Then you have to replace add, or semi-replace it by faking it out with
>>>> other FP operations. If we're running real x87 instructions (we
>>>> shouldn't be in 64 bit mode, but we still could) then those use 80 bit
>>>> operands internally. Where and when rounding takes place depends on when
>>>> those are moved in/out of the FPU, and will be different than true 64
>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>> behave better. It should also be the default in 64 bit mode since the
>>>> compiler can assume some basic SSE support is present.
>>> The rounding mode in SPARC is controlled by bits 31:30 of the FSR. My
>>> guess is that this is actually the problem and gcc 4.5+ is doing some
>>> code motion that is moving the actual fp code around our setting of
>>> the rounding mode. Using one of the asm tricks to prevent code
>>> movement (supposedly an empty asm() is supposed to be code barrier in
>>> gcc), might fix the problem. I don't have time to try it, but
>>> src/arch/sparc/isa/formats/basic.isa:145 looks like the right place.
>>> Also, trying to run the regression with m5.debug might see if the
>>> optimizer is at fault.
>>>
>>> Ali
>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> Ah, ok, so we do set the mode apparently. I'll try gem5.debug and also
>> look at that template and see what I can see. Thanks Ali!
>>
>> Gabe
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-dev-1Gs4CP2/***@public.gmane.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-10-27 15:32:10 UTC
Permalink
Are you positive this is it? It does sound very likely that this is the
issue, but is there indisputable evidence, like you looked at the
disassembly and you can see that things are scheduled in the wrong order?
I'm asking because even though I agree that this seems likely to be the
issue, it seems equally unlikely that gcc would reorder operations around
function calls like m5_fesetround() (unless they're inlined), and the fact
that the asm statements didn't help seems like further evidence that maybe
we're not focusing on exactly the right place.

Steve

On Thu, Oct 27, 2011 at 12:35 AM, Gabe Black <gblack-***@public.gmane.org> wrote:

> I'm convinced we've successfully identified the problem, but
> unfortunately I added barriers liberally and it still failed.
>
> Gabe
>
> int newrnd = M5_FE_TONEAREST;
> switch (Fsr<31:30>) {
> case 0: newrnd = M5_FE_TONEAREST; break;
> case 1: newrnd = M5_FE_TOWARDZERO; break;
> case 2: newrnd = M5_FE_UPWARD; break;
> case 3: newrnd = M5_FE_DOWNWARD; break;
> }
> __asm__ __volatile__ ("" ::: "memory");
> int oldrnd = m5_fegetround();
> __asm__ __volatile__ ("" ::: "memory");
> m5_fesetround(newrnd);
> __asm__ __volatile__ ("" ::: "memory");
> """
>
> fp_code += code
>
>
> fp_code += """
> __asm__ __volatile__ ("" ::: "memory");
> m5_fesetround(oldrnd);
> __asm__ __volatile__ ("" ::: "memory");
> """
> fp_code = filterDoubles(fp_code)
> iop = InstObjParams(name, Name, 'SparcStaticInst', fp_code, flags)
> header_output = BasicDeclare.subst(iop)
> decoder_output = BasicConstructor.subst(iop)
> decode_block = BasicDecode.subst(iop)
> exec_output = BasicExecute.subst(iop)
> }};
>
>
> On 10/26/11 07:10, Steve Reinhardt wrote:
> > I forgot to mention that I fired off a gem5.debug run before I went to
> bed
> > last night, and it completed successfully. So it does appear to be the
> > optimizer.
> >
> > Steve
> >
> > On Wed, Oct 26, 2011 at 12:55 AM, Gabe Black <gblack-***@public.gmane.org>
> wrote:
> >
> >> On 10/25/11 22:28, Ali Saidi wrote:
> >>> On Tue, 25 Oct 2011 11:53:29 -0700, Gabe Black <gblack-***@public.gmane.org>
> >>> wrote:
> >>>> On 10/25/11 07:46, Steve Reinhardt wrote:
> >>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <gblack-***@public.gmane.org>
> >>>>> wrote:
> >>>>>> I'm currently building binutils for SPARC, so hopefully I can
> >>>>>> disassemble some things and get a better idea of what's going on.
> It's
> >>>>>> probably going to be really annoying to figure it out.
> >>>>> If it's really just an FP rounding error, it might not be that
> >>>>> hard... just
> >>>>> look at the examples from the trace of where it's going wrong,
> >>>>> figure out
> >>>>> what the right answer is, and focus on those few instructions. FP
> >>>>> is pretty
> >>>>> thoroughly specified by IEEE, so if it's not an outright compiler
> >>>>> bug, maybe
> >>>>> it's just some change in the default rounding settings or something.
> >>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather than
> a
> >>>> standard. ARM isn't strictly conformant, and neither is x86. The
> default
> >>>> rounding mode *is* standard, though, and I don't think is adjusted in
> >>>> SPARC as a result of execution. If it changed somehow (unless I'm
> >>>> forgetting where SPARC does that) it's a fairly significant problem.
> >>>> Whether instructions generate +/- 0 in various situations may depend
> on,
> >>>> for instance, what order gcc decides to put the operands. I'm not sure
> >>>> that it does, but there are all kinds of weird, subtle behaviors with
> >>>> FP, and you can't just fix how add works if x86 picked the wrong
> thing.
> >>>> Then you have to replace add, or semi-replace it by faking it out with
> >>>> other FP operations. If we're running real x87 instructions (we
> >>>> shouldn't be in 64 bit mode, but we still could) then those use 80 bit
> >>>> operands internally. Where and when rounding takes place depends on
> when
> >>>> those are moved in/out of the FPU, and will be different than true 64
> >>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
> >>>> behave better. It should also be the default in 64 bit mode since the
> >>>> compiler can assume some basic SSE support is present.
> >>> The rounding mode in SPARC is controlled by bits 31:30 of the FSR. My
> >>> guess is that this is actually the problem and gcc 4.5+ is doing some
> >>> code motion that is moving the actual fp code around our setting of
> >>> the rounding mode. Using one of the asm tricks to prevent code
> >>> movement (supposedly an empty asm() is supposed to be code barrier in
> >>> gcc), might fix the problem. I don't have time to try it, but
> >>> src/arch/sparc/isa/formats/basic.isa:145 looks like the right place.
> >>> Also, trying to run the regression with m5.debug might see if the
> >>> optimizer is at fault.
> >>>
> >>> Ali
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> gem5-dev mailing list
> >>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>> http://m5sim.org/mailman/listinfo/gem5-dev
> >> Ah, ok, so we do set the mode apparently. I'll try gem5.debug and also
> >> look at that template and see what I can see. Thanks Ali!
> >>
> >> Gabe
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-dev-1Gs4CP2/***@public.gmane.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev-1Gs4CP2/***@public.gmane.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabe Black
2011-10-27 18:30:38 UTC
Permalink
Exactly this was happening on ARM, and that's why there are all the
weird __asm__ statements in its instructions. There I had to also
specify variables as inputs and outputs from the __asm__ statements so
that other instructions would have to produce their value before or
consume their value after that point. Here I can't do that since (quite
reasonably) the code that sets the rounding mode is factored out into a
common blob, and I don't know what the necessary variables are. There's
supposed to be some way to prevent this sort of problem, but for gcc
it's not implemented. I forget exactly how that's supposed to work.

Gabe

On 10/27/11 08:32, Steve Reinhardt wrote:
> Are you positive this is it? It does sound very likely that this is the
> issue, but is there indisputable evidence, like you looked at the
> disassembly and you can see that things are scheduled in the wrong order?
> I'm asking because even though I agree that this seems likely to be the
> issue, it seems equally unlikely that gcc would reorder operations around
> function calls like m5_fesetround() (unless they're inlined), and the fact
> that the asm statements didn't help seems like further evidence that maybe
> we're not focusing on exactly the right place.
>
> Steve
>
> On Thu, Oct 27, 2011 at 12:35 AM, Gabe Black <gblack-***@public.gmane.org> wrote:
>
>> I'm convinced we've successfully identified the problem, but
>> unfortunately I added barriers liberally and it still failed.
>>
>> Gabe
>>
>> int newrnd = M5_FE_TONEAREST;
>> switch (Fsr<31:30>) {
>> case 0: newrnd = M5_FE_TONEAREST; break;
>> case 1: newrnd = M5_FE_TOWARDZERO; break;
>> case 2: newrnd = M5_FE_UPWARD; break;
>> case 3: newrnd = M5_FE_DOWNWARD; break;
>> }
>> __asm__ __volatile__ ("" ::: "memory");
>> int oldrnd = m5_fegetround();
>> __asm__ __volatile__ ("" ::: "memory");
>> m5_fesetround(newrnd);
>> __asm__ __volatile__ ("" ::: "memory");
>> """
>>
>> fp_code += code
>>
>>
>> fp_code += """
>> __asm__ __volatile__ ("" ::: "memory");
>> m5_fesetround(oldrnd);
>> __asm__ __volatile__ ("" ::: "memory");
>> """
>> fp_code = filterDoubles(fp_code)
>> iop = InstObjParams(name, Name, 'SparcStaticInst', fp_code, flags)
>> header_output = BasicDeclare.subst(iop)
>> decoder_output = BasicConstructor.subst(iop)
>> decode_block = BasicDecode.subst(iop)
>> exec_output = BasicExecute.subst(iop)
>> }};
>>
>>
>> On 10/26/11 07:10, Steve Reinhardt wrote:
>>> I forgot to mention that I fired off a gem5.debug run before I went to
>> bed
>>> last night, and it completed successfully. So it does appear to be the
>>> optimizer.
>>>
>>> Steve
>>>
>>> On Wed, Oct 26, 2011 at 12:55 AM, Gabe Black <gblack-***@public.gmane.org>
>> wrote:
>>>> On 10/25/11 22:28, Ali Saidi wrote:
>>>>> On Tue, 25 Oct 2011 11:53:29 -0700, Gabe Black <gblack-***@public.gmane.org>
>>>>> wrote:
>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <gblack-***@public.gmane.org>
>>>>>>> wrote:
>>>>>>>> I'm currently building binutils for SPARC, so hopefully I can
>>>>>>>> disassemble some things and get a better idea of what's going on.
>> It's
>>>>>>>> probably going to be really annoying to figure it out.
>>>>>>> If it's really just an FP rounding error, it might not be that
>>>>>>> hard... just
>>>>>>> look at the examples from the trace of where it's going wrong,
>>>>>>> figure out
>>>>>>> what the right answer is, and focus on those few instructions. FP
>>>>>>> is pretty
>>>>>>> thoroughly specified by IEEE, so if it's not an outright compiler
>>>>>>> bug, maybe
>>>>>>> it's just some change in the default rounding settings or something.
>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather than
>> a
>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
>> default
>>>>>> rounding mode *is* standard, though, and I don't think is adjusted in
>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>>>> Whether instructions generate +/- 0 in various situations may depend
>> on,
>>>>>> for instance, what order gcc decides to put the operands. I'm not sure
>>>>>> that it does, but there are all kinds of weird, subtle behaviors with
>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
>> thing.
>>>>>> Then you have to replace add, or semi-replace it by faking it out with
>>>>>> other FP operations. If we're running real x87 instructions (we
>>>>>> shouldn't be in 64 bit mode, but we still could) then those use 80 bit
>>>>>> operands internally. Where and when rounding takes place depends on
>> when
>>>>>> those are moved in/out of the FPU, and will be different than true 64
>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>>>> behave better. It should also be the default in 64 bit mode since the
>>>>>> compiler can assume some basic SSE support is present.
>>>>> The rounding mode in SPARC is controlled by bits 31:30 of the FSR. My
>>>>> guess is that this is actually the problem and gcc 4.5+ is doing some
>>>>> code motion that is moving the actual fp code around our setting of
>>>>> the rounding mode. Using one of the asm tricks to prevent code
>>>>> movement (supposedly an empty asm() is supposed to be code barrier in
>>>>> gcc), might fix the problem. I don't have time to try it, but
>>>>> src/arch/sparc/isa/formats/basic.isa:145 looks like the right place.
>>>>> Also, trying to run the regression with m5.debug might see if the
>>>>> optimizer is at fault.
>>>>>
>>>>> Ali
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> Ah, ok, so we do set the mode apparently. I'll try gem5.debug and also
>>>> look at that template and see what I can see. Thanks Ali!
>>>>
>>>> Gabe
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-dev-1Gs4CP2/***@public.gmane.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-10-28 06:23:18 UTC
Permalink
I forgot to mention that while working on ARM, I did actually look at
the assembly that was generated and gcc was moving things around in less
than helpful ways. You're welcome to look at the assembly if you don't
believe me :-). SPARC is pretty straightforward ISA description wise, so
it should be too difficult to find the responsible code.

Gabe

On 10/27/11 11:30, Gabe Black wrote:
> Exactly this was happening on ARM, and that's why there are all the
> weird __asm__ statements in its instructions. There I had to also
> specify variables as inputs and outputs from the __asm__ statements so
> that other instructions would have to produce their value before or
> consume their value after that point. Here I can't do that since (quite
> reasonably) the code that sets the rounding mode is factored out into a
> common blob, and I don't know what the necessary variables are. There's
> supposed to be some way to prevent this sort of problem, but for gcc
> it's not implemented. I forget exactly how that's supposed to work.
>
> Gabe
>
> On 10/27/11 08:32, Steve Reinhardt wrote:
>> Are you positive this is it? It does sound very likely that this is the
>> issue, but is there indisputable evidence, like you looked at the
>> disassembly and you can see that things are scheduled in the wrong order?
>> I'm asking because even though I agree that this seems likely to be the
>> issue, it seems equally unlikely that gcc would reorder operations around
>> function calls like m5_fesetround() (unless they're inlined), and the fact
>> that the asm statements didn't help seems like further evidence that maybe
>> we're not focusing on exactly the right place.
>>
>> Steve
>>
>> On Thu, Oct 27, 2011 at 12:35 AM, Gabe Black <gblack-***@public.gmane.org> wrote:
>>
>>> I'm convinced we've successfully identified the problem, but
>>> unfortunately I added barriers liberally and it still failed.
>>>
>>> Gabe
>>>
>>> int newrnd = M5_FE_TONEAREST;
>>> switch (Fsr<31:30>) {
>>> case 0: newrnd = M5_FE_TONEAREST; break;
>>> case 1: newrnd = M5_FE_TOWARDZERO; break;
>>> case 2: newrnd = M5_FE_UPWARD; break;
>>> case 3: newrnd = M5_FE_DOWNWARD; break;
>>> }
>>> __asm__ __volatile__ ("" ::: "memory");
>>> int oldrnd = m5_fegetround();
>>> __asm__ __volatile__ ("" ::: "memory");
>>> m5_fesetround(newrnd);
>>> __asm__ __volatile__ ("" ::: "memory");
>>> """
>>>
>>> fp_code += code
>>>
>>>
>>> fp_code += """
>>> __asm__ __volatile__ ("" ::: "memory");
>>> m5_fesetround(oldrnd);
>>> __asm__ __volatile__ ("" ::: "memory");
>>> """
>>> fp_code = filterDoubles(fp_code)
>>> iop = InstObjParams(name, Name, 'SparcStaticInst', fp_code, flags)
>>> header_output = BasicDeclare.subst(iop)
>>> decoder_output = BasicConstructor.subst(iop)
>>> decode_block = BasicDecode.subst(iop)
>>> exec_output = BasicExecute.subst(iop)
>>> }};
>>>
>>>
>>> On 10/26/11 07:10, Steve Reinhardt wrote:
>>>> I forgot to mention that I fired off a gem5.debug run before I went to
>>> bed
>>>> last night, and it completed successfully. So it does appear to be the
>>>> optimizer.
>>>>
>>>> Steve
>>>>
>>>> On Wed, Oct 26, 2011 at 12:55 AM, Gabe Black <gblack-***@public.gmane.org>
>>> wrote:
>>>>> On 10/25/11 22:28, Ali Saidi wrote:
>>>>>> On Tue, 25 Oct 2011 11:53:29 -0700, Gabe Black <gblack-***@public.gmane.org>
>>>>>> wrote:
>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <gblack-***@public.gmane.org>
>>>>>>>> wrote:
>>>>>>>>> I'm currently building binutils for SPARC, so hopefully I can
>>>>>>>>> disassemble some things and get a better idea of what's going on.
>>> It's
>>>>>>>>> probably going to be really annoying to figure it out.
>>>>>>>> If it's really just an FP rounding error, it might not be that
>>>>>>>> hard... just
>>>>>>>> look at the examples from the trace of where it's going wrong,
>>>>>>>> figure out
>>>>>>>> what the right answer is, and focus on those few instructions. FP
>>>>>>>> is pretty
>>>>>>>> thoroughly specified by IEEE, so if it's not an outright compiler
>>>>>>>> bug, maybe
>>>>>>>> it's just some change in the default rounding settings or something.
>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather than
>>> a
>>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
>>> default
>>>>>>> rounding mode *is* standard, though, and I don't think is adjusted in
>>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>>>>> Whether instructions generate +/- 0 in various situations may depend
>>> on,
>>>>>>> for instance, what order gcc decides to put the operands. I'm not sure
>>>>>>> that it does, but there are all kinds of weird, subtle behaviors with
>>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
>>> thing.
>>>>>>> Then you have to replace add, or semi-replace it by faking it out with
>>>>>>> other FP operations. If we're running real x87 instructions (we
>>>>>>> shouldn't be in 64 bit mode, but we still could) then those use 80 bit
>>>>>>> operands internally. Where and when rounding takes place depends on
>>> when
>>>>>>> those are moved in/out of the FPU, and will be different than true 64
>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>>>>> behave better. It should also be the default in 64 bit mode since the
>>>>>>> compiler can assume some basic SSE support is present.
>>>>>> The rounding mode in SPARC is controlled by bits 31:30 of the FSR. My
>>>>>> guess is that this is actually the problem and gcc 4.5+ is doing some
>>>>>> code motion that is moving the actual fp code around our setting of
>>>>>> the rounding mode. Using one of the asm tricks to prevent code
>>>>>> movement (supposedly an empty asm() is supposed to be code barrier in
>>>>>> gcc), might fix the problem. I don't have time to try it, but
>>>>>> src/arch/sparc/isa/formats/basic.isa:145 looks like the right place.
>>>>>> Also, trying to run the regression with m5.debug might see if the
>>>>>> optimizer is at fault.
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> Ah, ok, so we do set the mode apparently. I'll try gem5.debug and also
>>>>> look at that template and see what I can see. Thanks Ali!
>>>>>
>>>>> Gabe
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-dev-1Gs4CP2/***@public.gmane.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-dev-1Gs4CP2/***@public.gmane.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-10-28 14:30:34 UTC
Permalink
I believe what you say, it's what you hadn't said that I was wondering about
;-).

Steve

On Thu, Oct 27, 2011 at 11:23 PM, Gabe Black <gblack-***@public.gmane.org> wrote:

> I forgot to mention that while working on ARM, I did actually look at
> the assembly that was generated and gcc was moving things around in less
> than helpful ways. You're welcome to look at the assembly if you don't
> believe me :-). SPARC is pretty straightforward ISA description wise, so
> it should be too difficult to find the responsible code.
>
> Gabe
>
> On 10/27/11 11:30, Gabe Black wrote:
> > Exactly this was happening on ARM, and that's why there are all the
> > weird __asm__ statements in its instructions. There I had to also
> > specify variables as inputs and outputs from the __asm__ statements so
> > that other instructions would have to produce their value before or
> > consume their value after that point. Here I can't do that since (quite
> > reasonably) the code that sets the rounding mode is factored out into a
> > common blob, and I don't know what the necessary variables are. There's
> > supposed to be some way to prevent this sort of problem, but for gcc
> > it's not implemented. I forget exactly how that's supposed to work.
> >
> > Gabe
> >
> > On 10/27/11 08:32, Steve Reinhardt wrote:
> >> Are you positive this is it? It does sound very likely that this is the
> >> issue, but is there indisputable evidence, like you looked at the
> >> disassembly and you can see that things are scheduled in the wrong
> order?
> >> I'm asking because even though I agree that this seems likely to be the
> >> issue, it seems equally unlikely that gcc would reorder operations
> around
> >> function calls like m5_fesetround() (unless they're inlined), and the
> fact
> >> that the asm statements didn't help seems like further evidence that
> maybe
> >> we're not focusing on exactly the right place.
> >>
> >> Steve
> >>
> >> On Thu, Oct 27, 2011 at 12:35 AM, Gabe Black <gblack-***@public.gmane.org>
> wrote:
> >>
> >>> I'm convinced we've successfully identified the problem, but
> >>> unfortunately I added barriers liberally and it still failed.
> >>>
> >>> Gabe
> >>>
> >>> int newrnd = M5_FE_TONEAREST;
> >>> switch (Fsr<31:30>) {
> >>> case 0: newrnd = M5_FE_TONEAREST; break;
> >>> case 1: newrnd = M5_FE_TOWARDZERO; break;
> >>> case 2: newrnd = M5_FE_UPWARD; break;
> >>> case 3: newrnd = M5_FE_DOWNWARD; break;
> >>> }
> >>> __asm__ __volatile__ ("" ::: "memory");
> >>> int oldrnd = m5_fegetround();
> >>> __asm__ __volatile__ ("" ::: "memory");
> >>> m5_fesetround(newrnd);
> >>> __asm__ __volatile__ ("" ::: "memory");
> >>> """
> >>>
> >>> fp_code += code
> >>>
> >>>
> >>> fp_code += """
> >>> __asm__ __volatile__ ("" ::: "memory");
> >>> m5_fesetround(oldrnd);
> >>> __asm__ __volatile__ ("" ::: "memory");
> >>> """
> >>> fp_code = filterDoubles(fp_code)
> >>> iop = InstObjParams(name, Name, 'SparcStaticInst', fp_code,
> flags)
> >>> header_output = BasicDeclare.subst(iop)
> >>> decoder_output = BasicConstructor.subst(iop)
> >>> decode_block = BasicDecode.subst(iop)
> >>> exec_output = BasicExecute.subst(iop)
> >>> }};
> >>>
> >>>
> >>> On 10/26/11 07:10, Steve Reinhardt wrote:
> >>>> I forgot to mention that I fired off a gem5.debug run before I went to
> >>> bed
> >>>> last night, and it completed successfully. So it does appear to be
> the
> >>>> optimizer.
> >>>>
> >>>> Steve
> >>>>
> >>>> On Wed, Oct 26, 2011 at 12:55 AM, Gabe Black <gblack-***@public.gmane.org>
> >>> wrote:
> >>>>> On 10/25/11 22:28, Ali Saidi wrote:
> >>>>>> On Tue, 25 Oct 2011 11:53:29 -0700, Gabe Black <
> gblack-***@public.gmane.org>
> >>>>>> wrote:
> >>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
> >>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <
> gblack-***@public.gmane.org>
> >>>>>>>> wrote:
> >>>>>>>>> I'm currently building binutils for SPARC, so hopefully I can
> >>>>>>>>> disassemble some things and get a better idea of what's going on.
> >>> It's
> >>>>>>>>> probably going to be really annoying to figure it out.
> >>>>>>>> If it's really just an FP rounding error, it might not be that
> >>>>>>>> hard... just
> >>>>>>>> look at the examples from the trace of where it's going wrong,
> >>>>>>>> figure out
> >>>>>>>> what the right answer is, and focus on those few instructions. FP
> >>>>>>>> is pretty
> >>>>>>>> thoroughly specified by IEEE, so if it's not an outright compiler
> >>>>>>>> bug, maybe
> >>>>>>>> it's just some change in the default rounding settings or
> something.
> >>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather
> than
> >>> a
> >>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
> >>> default
> >>>>>>> rounding mode *is* standard, though, and I don't think is adjusted
> in
> >>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
> >>>>>>> forgetting where SPARC does that) it's a fairly significant
> problem.
> >>>>>>> Whether instructions generate +/- 0 in various situations may
> depend
> >>> on,
> >>>>>>> for instance, what order gcc decides to put the operands. I'm not
> sure
> >>>>>>> that it does, but there are all kinds of weird, subtle behaviors
> with
> >>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
> >>> thing.
> >>>>>>> Then you have to replace add, or semi-replace it by faking it out
> with
> >>>>>>> other FP operations. If we're running real x87 instructions (we
> >>>>>>> shouldn't be in 64 bit mode, but we still could) then those use 80
> bit
> >>>>>>> operands internally. Where and when rounding takes place depends on
> >>> when
> >>>>>>> those are moved in/out of the FPU, and will be different than true
> 64
> >>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
> >>>>>>> behave better. It should also be the default in 64 bit mode since
> the
> >>>>>>> compiler can assume some basic SSE support is present.
> >>>>>> The rounding mode in SPARC is controlled by bits 31:30 of the FSR.
> My
> >>>>>> guess is that this is actually the problem and gcc 4.5+ is doing
> some
> >>>>>> code motion that is moving the actual fp code around our setting of
> >>>>>> the rounding mode. Using one of the asm tricks to prevent code
> >>>>>> movement (supposedly an empty asm() is supposed to be code barrier
> in
> >>>>>> gcc), might fix the problem. I don't have time to try it, but
> >>>>>> src/arch/sparc/isa/formats/basic.isa:145 looks like the right place.
> >>>>>> Also, trying to run the regression with m5.debug might see if the
> >>>>>> optimizer is at fault.
> >>>>>>
> >>>>>> Ali
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> gem5-dev mailing list
> >>>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>> Ah, ok, so we do set the mode apparently. I'll try gem5.debug and
> also
> >>>>> look at that template and see what I can see. Thanks Ali!
> >>>>>
> >>>>> Gabe
> >>>>> _______________________________________________
> >>>>> gem5-dev mailing list
> >>>>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>
> >>>> _______________________________________________
> >>>> gem5-dev mailing list
> >>>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>> _______________________________________________
> >>> gem5-dev mailing list
> >>> gem5-dev-1Gs4CP2/***@public.gmane.org
> >>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-dev-1Gs4CP2/***@public.gmane.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-dev-1Gs4CP2/***@public.gmane.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-dev-1Gs4CP2/***@public.gmane.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Radivoje Vasiljevic
2011-10-27 14:43:40 UTC
Permalink
----- Original Message -----
From: "Gabe Black" <***@eecs.umich.edu>
To: <gem5-***@gem5.org>
Sent: 25. октобар 2011 20:53
Subject: Re: [gem5-dev] Failed SPARC test


> On 10/25/11 07:46, Steve Reinhardt wrote:
>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>> wrote:

[snip]
>
> Yeah, I think ISAs treat IEEE as a really good suggestion rather than a
> standard. ARM isn't strictly conformant, and neither is x86. The default
> rounding mode *is* standard, though, and I don't think is adjusted in
> SPARC as a result of execution. If it changed somehow (unless I'm
> forgetting where SPARC does that) it's a fairly significant problem.
> Whether instructions generate +/- 0 in various situations may depend on,
> for instance, what order gcc decides to put the operands. I'm not sure
> that it does, but there are all kinds of weird, subtle behaviors with
> FP, and you can't just fix how add works if x86 picked the wrong thing.
> Then you have to replace add, or semi-replace it by faking it out with
> other FP operations. If we're running real x87 instructions (we
> shouldn't be in 64 bit mode, but we still could) then those use 80 bit
> operands internally. Where and when rounding takes place depends on when
> those are moved in/out of the FPU, and will be different than true 64
> bit operands. SSE based FP uses real 64 bit doubles, so that should
> behave better. It should also be the default in 64 bit mode since the
> compiler can assume some basic SSE support is present.
>

What about FP emulation using integers and some kind of multiple precision
arithmetic? Then every detail could be modeled, including x87 "floats" and
"doubles" (in registers exponent field is still 15 bits, not 8/11 and makes
mess of overflow/underflow, or it will go in memory and will be proper
float/double). Gcc has some switches regarding that behavior but that is
very fragile (more like suggestion to compiler then enforcing option).
Double rounding in x87 is special story because double extended mantissa is
not more than twice longer then one for double so double rounding can give
different results compared to single rounding (this situation can't happen
with float vs double). One solution, for example: splitting mantissas into
to halves and performing operation, all bits would be available and then
proper any kind of rounding could be enforced (real ieee or "isa style
ieee"). Performing those operations is not very slow and it is fairly ILP
reach so slowdown is not that great as when pure number of instructions is
compared (although to have robust code, cpu and compiler independence,
specially about "optimizing code" some tests are needed to eradicate
subnormals due poor support/trap emulation). Plus if instructions are mixed
in right way both int and fpu units can be kept busy. Exponent can be one
short and problem solved. Only division can be somewhattricky (and slow),
but it can be done too.


>> Even if the FP rounding error isn't the source of the problem, it might
>> be
>> easiest to fix that and get it out of the way so we can see what the
>> actual
>> problem is.
>>
>> If you really want to know *why* the kernel is doing all this FP, then
>> yes,
>> you probably need to look at the source code.
>>
>> Steve
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabe Black
2011-10-28 06:36:55 UTC
Permalink
I think there was talk of an FP emulation library a long time ago
(before I was involved with M5) but we decided not to do something like
that for some reason. Using regular built in FP support gets us most of
the way with minimal hassle, but then there are situations like this
where it really causes trouble. I presume the prior discussion might
have been about whether getting most of the way there was good enough,
and that it's simpler.

Gabe

On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>
> ----- Original Message ----- From: "Gabe Black" <***@eecs.umich.edu>
> To: <gem5-***@gem5.org>
> Sent: 25. октобар 2011 20:53
> Subject: Re: [gem5-dev] Failed SPARC test
>
>
>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>>> wrote:
>
> [snip]
>>
>> Yeah, I think ISAs treat IEEE as a really good suggestion rather than a
>> standard. ARM isn't strictly conformant, and neither is x86. The default
>> rounding mode *is* standard, though, and I don't think is adjusted in
>> SPARC as a result of execution. If it changed somehow (unless I'm
>> forgetting where SPARC does that) it's a fairly significant problem.
>> Whether instructions generate +/- 0 in various situations may depend on,
>> for instance, what order gcc decides to put the operands. I'm not sure
>> that it does, but there are all kinds of weird, subtle behaviors with
>> FP, and you can't just fix how add works if x86 picked the wrong thing.
>> Then you have to replace add, or semi-replace it by faking it out with
>> other FP operations. If we're running real x87 instructions (we
>> shouldn't be in 64 bit mode, but we still could) then those use 80 bit
>> operands internally. Where and when rounding takes place depends on when
>> those are moved in/out of the FPU, and will be different than true 64
>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>> behave better. It should also be the default in 64 bit mode since the
>> compiler can assume some basic SSE support is present.
>>
>
> What about FP emulation using integers and some kind of multiple
> precision
> arithmetic? Then every detail could be modeled, including x87 "floats"
> and
> "doubles" (in registers exponent field is still 15 bits, not 8/11 and
> makes
> mess of overflow/underflow, or it will go in memory and will be proper
> float/double). Gcc has some switches regarding that behavior but that is
> very fragile (more like suggestion to compiler then enforcing option).
> Double rounding in x87 is special story because double extended
> mantissa is not more than twice longer then one for double so double
> rounding can give different results compared to single rounding (this
> situation can't happen
> with float vs double). One solution, for example: splitting mantissas
> into to halves and performing operation, all bits would be available
> and then proper any kind of rounding could be enforced (real ieee or
> "isa style ieee"). Performing those operations is not very slow and it
> is fairly ILP reach so slowdown is not that great as when pure number
> of instructions is compared (although to have robust code, cpu and
> compiler independence, specially about "optimizing code" some tests
> are needed to eradicate subnormals due poor support/trap emulation).
> Plus if instructions are mixed in right way both int and fpu units can
> be kept busy. Exponent can be one short and problem solved. Only
> division can be somewhattricky (and slow), but it can be done too.
>
>
>>> Even if the FP rounding error isn't the source of the problem, it might
>>> be
>>> easiest to fix that and get it out of the way so we can see what the
>>> actual
>>> problem is.
>>>
>>> If you really want to know *why* the kernel is doing all this FP, then
>>> yes,
>>> you probably need to look at the source code.
>>>
>>> Steve
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-10-28 14:38:59 UTC
Permalink
Yes, I think there exists at least one software IEEE FP implementation out
there that we had talked about incorporating at some point (long ago).
Unfortunately, as is discussed below, that's not even the issue, as we
really want to model the not-quite-IEEE (or in the case of x87,
not-even-close) semantics of the hardware alone, which would require more
effort.

If someone really cared about modeling the ISA FP support precisely that
would be an interesting project, and if it was done cleanly (probably with
the option to turn it on or off) we'd be glad to incorporate it.

Ironically I think the issue here is not that the HW FP is not good enough
for our purposes, it's that the software stack doesn't give us clean enough
access to the HW facilities (gcc in particular, though C itself may share
part of the blame).

Steve

On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu> wrote:

> I think there was talk of an FP emulation library a long time ago
> (before I was involved with M5) but we decided not to do something like
> that for some reason. Using regular built in FP support gets us most of
> the way with minimal hassle, but then there are situations like this
> where it really causes trouble. I presume the prior discussion might
> have been about whether getting most of the way there was good enough,
> and that it's simpler.
>
> Gabe
>
> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
> >
> > ----- Original Message ----- From: "Gabe Black" <***@eecs.umich.edu>
> > To: <gem5-***@gem5.org>
> > Sent: 25. октобар 2011 20:53
> > Subject: Re: [gem5-dev] Failed SPARC test
> >
> >
> >> On 10/25/11 07:46, Steve Reinhardt wrote:
> >>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
> >>> wrote:
> >
> > [snip]
> >>
> >> Yeah, I think ISAs treat IEEE as a really good suggestion rather than a
> >> standard. ARM isn't strictly conformant, and neither is x86. The default
> >> rounding mode *is* standard, though, and I don't think is adjusted in
> >> SPARC as a result of execution. If it changed somehow (unless I'm
> >> forgetting where SPARC does that) it's a fairly significant problem.
> >> Whether instructions generate +/- 0 in various situations may depend on,
> >> for instance, what order gcc decides to put the operands. I'm not sure
> >> that it does, but there are all kinds of weird, subtle behaviors with
> >> FP, and you can't just fix how add works if x86 picked the wrong thing.
> >> Then you have to replace add, or semi-replace it by faking it out with
> >> other FP operations. If we're running real x87 instructions (we
> >> shouldn't be in 64 bit mode, but we still could) then those use 80 bit
> >> operands internally. Where and when rounding takes place depends on when
> >> those are moved in/out of the FPU, and will be different than true 64
> >> bit operands. SSE based FP uses real 64 bit doubles, so that should
> >> behave better. It should also be the default in 64 bit mode since the
> >> compiler can assume some basic SSE support is present.
> >>
> >
> > What about FP emulation using integers and some kind of multiple
> > precision
> > arithmetic? Then every detail could be modeled, including x87 "floats"
> > and
> > "doubles" (in registers exponent field is still 15 bits, not 8/11 and
> > makes
> > mess of overflow/underflow, or it will go in memory and will be proper
> > float/double). Gcc has some switches regarding that behavior but that is
> > very fragile (more like suggestion to compiler then enforcing option).
> > Double rounding in x87 is special story because double extended
> > mantissa is not more than twice longer then one for double so double
> > rounding can give different results compared to single rounding (this
> > situation can't happen
> > with float vs double). One solution, for example: splitting mantissas
> > into to halves and performing operation, all bits would be available
> > and then proper any kind of rounding could be enforced (real ieee or
> > "isa style ieee"). Performing those operations is not very slow and it
> > is fairly ILP reach so slowdown is not that great as when pure number
> > of instructions is compared (although to have robust code, cpu and
> > compiler independence, specially about "optimizing code" some tests
> > are needed to eradicate subnormals due poor support/trap emulation).
> > Plus if instructions are mixed in right way both int and fpu units can
> > be kept busy. Exponent can be one short and problem solved. Only
> > division can be somewhattricky (and slow), but it can be done too.
> >
> >
> >>> Even if the FP rounding error isn't the source of the problem, it might
> >>> be
> >>> easiest to fix that and get it out of the way so we can see what the
> >>> actual
> >>> problem is.
> >>>
> >>> If you really want to know *why* the kernel is doing all this FP, then
> >>> yes,
> >>> you probably need to look at the source code.
> >>>
> >>> Steve
> >>> _______________________________________________
> >>> gem5-dev mailing list
> >>> gem5-***@gem5.org
> >>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-***@gem5.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >
> >
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-***@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Ali Saidi
2011-10-28 15:31:46 UTC
Permalink
I'm still not 100% convinced that this is it. I agree it's highly
likely, but it could be some other code movement or a bug in the
optimizer (we have seen them before). I wonder if you can selectively
optimize functions. Maybe a good start is compiling everything -O3
except the atomic execute function and make sure it still works.

Ali



On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <***@gmail.com>
wrote:
> Yes, I think there exists at least one software IEEE FP
> implementation out
> there that we had talked about incorporating at some point (long
> ago).
> Unfortunately, as is discussed below, that's not even the issue, as
> we
> really want to model the not-quite-IEEE (or in the case of x87,
> not-even-close) semantics of the hardware alone, which would require
> more
> effort.
>
> If someone really cared about modeling the ISA FP support precisely
> that
> would be an interesting project, and if it was done cleanly (probably
> with
> the option to turn it on or off) we'd be glad to incorporate it.
>
> Ironically I think the issue here is not that the HW FP is not good
> enough
> for our purposes, it's that the software stack doesn't give us clean
> enough
> access to the HW facilities (gcc in particular, though C itself may
> share
> part of the blame).
>
> Steve
>
> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu>
> wrote:
>
>> I think there was talk of an FP emulation library a long time ago
>> (before I was involved with M5) but we decided not to do something
>> like
>> that for some reason. Using regular built in FP support gets us most
>> of
>> the way with minimal hassle, but then there are situations like this
>> where it really causes trouble. I presume the prior discussion might
>> have been about whether getting most of the way there was good
>> enough,
>> and that it's simpler.
>>
>> Gabe
>>
>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>> >
>> > ----- Original Message ----- From: "Gabe Black"
>> <***@eecs.umich.edu>
>> > To: <gem5-***@gem5.org>
>> > Sent: 25. октобар 2011 20:53
>> > Subject: Re: [gem5-dev] Failed SPARC test
>> >
>> >
>> >> On 10/25/11 07:46, Steve Reinhardt wrote:
>> >>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black
>> <***@eecs.umich.edu>
>> >>> wrote:
>> >
>> > [snip]
>> >>
>> >> Yeah, I think ISAs treat IEEE as a really good suggestion rather
>> than a
>> >> standard. ARM isn't strictly conformant, and neither is x86. The
>> default
>> >> rounding mode *is* standard, though, and I don't think is
>> adjusted in
>> >> SPARC as a result of execution. If it changed somehow (unless I'm
>> >> forgetting where SPARC does that) it's a fairly significant
>> problem.
>> >> Whether instructions generate +/- 0 in various situations may
>> depend on,
>> >> for instance, what order gcc decides to put the operands. I'm not
>> sure
>> >> that it does, but there are all kinds of weird, subtle behaviors
>> with
>> >> FP, and you can't just fix how add works if x86 picked the wrong
>> thing.
>> >> Then you have to replace add, or semi-replace it by faking it out
>> with
>> >> other FP operations. If we're running real x87 instructions (we
>> >> shouldn't be in 64 bit mode, but we still could) then those use
>> 80 bit
>> >> operands internally. Where and when rounding takes place depends
>> on when
>> >> those are moved in/out of the FPU, and will be different than
>> true 64
>> >> bit operands. SSE based FP uses real 64 bit doubles, so that
>> should
>> >> behave better. It should also be the default in 64 bit mode since
>> the
>> >> compiler can assume some basic SSE support is present.
>> >>
>> >
>> > What about FP emulation using integers and some kind of multiple
>> > precision
>> > arithmetic? Then every detail could be modeled, including x87
>> "floats"
>> > and
>> > "doubles" (in registers exponent field is still 15 bits, not 8/11
>> and
>> > makes
>> > mess of overflow/underflow, or it will go in memory and will be
>> proper
>> > float/double). Gcc has some switches regarding that behavior but
>> that is
>> > very fragile (more like suggestion to compiler then enforcing
>> option).
>> > Double rounding in x87 is special story because double extended
>> > mantissa is not more than twice longer then one for double so
>> double
>> > rounding can give different results compared to single rounding
>> (this
>> > situation can't happen
>> > with float vs double). One solution, for example: splitting
>> mantissas
>> > into to halves and performing operation, all bits would be
>> available
>> > and then proper any kind of rounding could be enforced (real ieee
>> or
>> > "isa style ieee"). Performing those operations is not very slow
>> and it
>> > is fairly ILP reach so slowdown is not that great as when pure
>> number
>> > of instructions is compared (although to have robust code, cpu and
>> > compiler independence, specially about "optimizing code" some
>> tests
>> > are needed to eradicate subnormals due poor support/trap
>> emulation).
>> > Plus if instructions are mixed in right way both int and fpu units
>> can
>> > be kept busy. Exponent can be one short and problem solved. Only
>> > division can be somewhattricky (and slow), but it can be done too.
>> >
>> >
>> >>> Even if the FP rounding error isn't the source of the problem,
>> it might
>> >>> be
>> >>> easiest to fix that and get it out of the way so we can see what
>> the
>> >>> actual
>> >>> problem is.
>> >>>
>> >>> If you really want to know *why* the kernel is doing all this
>> FP, then
>> >>> yes,
>> >>> you probably need to look at the source code.
>> >>>
>> >>> Steve
>> >>> _______________________________________________
>> >>> gem5-dev mailing list
>> >>> gem5-***@gem5.org
>> >>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>
>> >> _______________________________________________
>> >> gem5-dev mailing list
>> >> gem5-***@gem5.org
>> >> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>
>> >
>> >
>> > _______________________________________________
>> > gem5-dev mailing list
>> > gem5-***@gem5.org
>> > http://m5sim.org/mailman/listinfo/gem5-dev
>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-10-29 20:31:37 UTC
Permalink
Here is some suspect assembly from Fadds for the atomic simple CPU

0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
0x00000000008d5393 <+387>: mov %eax,%r15d
0x00000000008d5396 <+390>: mov %r14d,%edi
0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
0x00000000008d539e <+398>: mov %r15d,%edi
0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>


This is, more or less, from the following code.


__asm__ __volatile__ ("" ::: "memory");
int oldrnd = m5_fegetround();
__asm__ __volatile__ ("" ::: "memory");
m5_fesetround(newrnd);
__asm__ __volatile__ ("" ::: "memory");
Frds = Frs1s + Frs2s;
__asm__ __volatile__ ("" ::: "memory");
m5_fesetround(oldrnd);
__asm__ __volatile__ ("" ::: "memory");


Note that the addition was moved out of the middle and fesetround was
called twice back to back, once to set the new rounding mode, and once
to set it right back again.

Gabe

On 10/28/11 08:31, Ali Saidi wrote:
> I'm still not 100% convinced that this is it. I agree it's highly
> likely, but it could be some other code movement or a bug in the
> optimizer (we have seen them before). I wonder if you can selectively
> optimize functions. Maybe a good start is compiling everything -O3
> except the atomic execute function and make sure it still works.
>
> Ali
>
>
>
> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <***@gmail.com>
> wrote:
>> Yes, I think there exists at least one software IEEE FP
>> implementation out
>> there that we had talked about incorporating at some point (long ago).
>> Unfortunately, as is discussed below, that's not even the issue, as we
>> really want to model the not-quite-IEEE (or in the case of x87,
>> not-even-close) semantics of the hardware alone, which would require
>> more
>> effort.
>>
>> If someone really cared about modeling the ISA FP support precisely that
>> would be an interesting project, and if it was done cleanly (probably
>> with
>> the option to turn it on or off) we'd be glad to incorporate it.
>>
>> Ironically I think the issue here is not that the HW FP is not good
>> enough
>> for our purposes, it's that the software stack doesn't give us clean
>> enough
>> access to the HW facilities (gcc in particular, though C itself may
>> share
>> part of the blame).
>>
>> Steve
>>
>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu>
>> wrote:
>>
>>> I think there was talk of an FP emulation library a long time ago
>>> (before I was involved with M5) but we decided not to do something like
>>> that for some reason. Using regular built in FP support gets us most of
>>> the way with minimal hassle, but then there are situations like this
>>> where it really causes trouble. I presume the prior discussion might
>>> have been about whether getting most of the way there was good enough,
>>> and that it's simpler.
>>>
>>> Gabe
>>>
>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>>> >
>>> > ----- Original Message ----- From: "Gabe Black"
>>> <***@eecs.umich.edu>
>>> > To: <gem5-***@gem5.org>
>>> > Sent: 25. октобар 2011 20:53
>>> > Subject: Re: [gem5-dev] Failed SPARC test
>>> >
>>> >
>>> >> On 10/25/11 07:46, Steve Reinhardt wrote:
>>> >>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>>> >>> wrote:
>>> >
>>> > [snip]
>>> >>
>>> >> Yeah, I think ISAs treat IEEE as a really good suggestion rather
>>> than a
>>> >> standard. ARM isn't strictly conformant, and neither is x86. The
>>> default
>>> >> rounding mode *is* standard, though, and I don't think is
>>> adjusted in
>>> >> SPARC as a result of execution. If it changed somehow (unless I'm
>>> >> forgetting where SPARC does that) it's a fairly significant problem.
>>> >> Whether instructions generate +/- 0 in various situations may
>>> depend on,
>>> >> for instance, what order gcc decides to put the operands. I'm not
>>> sure
>>> >> that it does, but there are all kinds of weird, subtle behaviors
>>> with
>>> >> FP, and you can't just fix how add works if x86 picked the wrong
>>> thing.
>>> >> Then you have to replace add, or semi-replace it by faking it out
>>> with
>>> >> other FP operations. If we're running real x87 instructions (we
>>> >> shouldn't be in 64 bit mode, but we still could) then those use
>>> 80 bit
>>> >> operands internally. Where and when rounding takes place depends
>>> on when
>>> >> those are moved in/out of the FPU, and will be different than
>>> true 64
>>> >> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>> >> behave better. It should also be the default in 64 bit mode since
>>> the
>>> >> compiler can assume some basic SSE support is present.
>>> >>
>>> >
>>> > What about FP emulation using integers and some kind of multiple
>>> > precision
>>> > arithmetic? Then every detail could be modeled, including x87
>>> "floats"
>>> > and
>>> > "doubles" (in registers exponent field is still 15 bits, not 8/11 and
>>> > makes
>>> > mess of overflow/underflow, or it will go in memory and will be
>>> proper
>>> > float/double). Gcc has some switches regarding that behavior but
>>> that is
>>> > very fragile (more like suggestion to compiler then enforcing
>>> option).
>>> > Double rounding in x87 is special story because double extended
>>> > mantissa is not more than twice longer then one for double so double
>>> > rounding can give different results compared to single rounding (this
>>> > situation can't happen
>>> > with float vs double). One solution, for example: splitting mantissas
>>> > into to halves and performing operation, all bits would be available
>>> > and then proper any kind of rounding could be enforced (real ieee or
>>> > "isa style ieee"). Performing those operations is not very slow
>>> and it
>>> > is fairly ILP reach so slowdown is not that great as when pure number
>>> > of instructions is compared (although to have robust code, cpu and
>>> > compiler independence, specially about "optimizing code" some tests
>>> > are needed to eradicate subnormals due poor support/trap emulation).
>>> > Plus if instructions are mixed in right way both int and fpu units
>>> can
>>> > be kept busy. Exponent can be one short and problem solved. Only
>>> > division can be somewhattricky (and slow), but it can be done too.
>>> >
>>> >
>>> >>> Even if the FP rounding error isn't the source of the problem,
>>> it might
>>> >>> be
>>> >>> easiest to fix that and get it out of the way so we can see what
>>> the
>>> >>> actual
>>> >>> problem is.
>>> >>>
>>> >>> If you really want to know *why* the kernel is doing all this
>>> FP, then
>>> >>> yes,
>>> >>> you probably need to look at the source code.
>>> >>>
>>> >>> Steve
>>> >>> _______________________________________________
>>> >>> gem5-dev mailing list
>>> >>> gem5-***@gem5.org
>>> >>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> >>
>>> >> _______________________________________________
>>> >> gem5-dev mailing list
>>> >> gem5-***@gem5.org
>>> >> http://m5sim.org/mailman/listinfo/gem5-dev
>>> >>
>>> >
>>> >
>>> > _______________________________________________
>>> > gem5-dev mailing list
>>> > gem5-***@gem5.org
>>> > http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-10-29 20:44:59 UTC
Permalink
Here's a discussion on the gcc mailing list of the thing I was talking
about before that's supposed to fix this, I think.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678

Our barriers aren't working since Frs1s, Frs2s, and Frds could all be
registers.

Gabe

On 10/29/11 13:31, Gabe Black wrote:
> Here is some suspect assembly from Fadds for the atomic simple CPU
>
> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
> 0x00000000008d5393 <+387>: mov %eax,%r15d
> 0x00000000008d5396 <+390>: mov %r14d,%edi
> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
> 0x00000000008d539e <+398>: mov %r15d,%edi
> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
>
>
> This is, more or less, from the following code.
>
>
> __asm__ __volatile__ ("" ::: "memory");
> int oldrnd = m5_fegetround();
> __asm__ __volatile__ ("" ::: "memory");
> m5_fesetround(newrnd);
> __asm__ __volatile__ ("" ::: "memory");
> Frds = Frs1s + Frs2s;
> __asm__ __volatile__ ("" ::: "memory");
> m5_fesetround(oldrnd);
> __asm__ __volatile__ ("" ::: "memory");
>
>
> Note that the addition was moved out of the middle and fesetround was
> called twice back to back, once to set the new rounding mode, and once
> to set it right back again.
>
> Gabe
>
> On 10/28/11 08:31, Ali Saidi wrote:
>> I'm still not 100% convinced that this is it. I agree it's highly
>> likely, but it could be some other code movement or a bug in the
>> optimizer (we have seen them before). I wonder if you can selectively
>> optimize functions. Maybe a good start is compiling everything -O3
>> except the atomic execute function and make sure it still works.
>>
>> Ali
>>
>>
>>
>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <***@gmail.com>
>> wrote:
>>> Yes, I think there exists at least one software IEEE FP
>>> implementation out
>>> there that we had talked about incorporating at some point (long ago).
>>> Unfortunately, as is discussed below, that's not even the issue, as we
>>> really want to model the not-quite-IEEE (or in the case of x87,
>>> not-even-close) semantics of the hardware alone, which would require
>>> more
>>> effort.
>>>
>>> If someone really cared about modeling the ISA FP support precisely that
>>> would be an interesting project, and if it was done cleanly (probably
>>> with
>>> the option to turn it on or off) we'd be glad to incorporate it.
>>>
>>> Ironically I think the issue here is not that the HW FP is not good
>>> enough
>>> for our purposes, it's that the software stack doesn't give us clean
>>> enough
>>> access to the HW facilities (gcc in particular, though C itself may
>>> share
>>> part of the blame).
>>>
>>> Steve
>>>
>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu>
>>> wrote:
>>>
>>>> I think there was talk of an FP emulation library a long time ago
>>>> (before I was involved with M5) but we decided not to do something like
>>>> that for some reason. Using regular built in FP support gets us most of
>>>> the way with minimal hassle, but then there are situations like this
>>>> where it really causes trouble. I presume the prior discussion might
>>>> have been about whether getting most of the way there was good enough,
>>>> and that it's simpler.
>>>>
>>>> Gabe
>>>>
>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>>>>> ----- Original Message ----- From: "Gabe Black"
>>>> <***@eecs.umich.edu>
>>>>> To: <gem5-***@gem5.org>
>>>>> Sent: 25. октобар 2011 20:53
>>>>> Subject: Re: [gem5-dev] Failed SPARC test
>>>>>
>>>>>
>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>>>>>>> wrote:
>>>>> [snip]
>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather
>>>> than a
>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
>>>> default
>>>>>> rounding mode *is* standard, though, and I don't think is
>>>> adjusted in
>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>>>> Whether instructions generate +/- 0 in various situations may
>>>> depend on,
>>>>>> for instance, what order gcc decides to put the operands. I'm not
>>>> sure
>>>>>> that it does, but there are all kinds of weird, subtle behaviors
>>>> with
>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
>>>> thing.
>>>>>> Then you have to replace add, or semi-replace it by faking it out
>>>> with
>>>>>> other FP operations. If we're running real x87 instructions (we
>>>>>> shouldn't be in 64 bit mode, but we still could) then those use
>>>> 80 bit
>>>>>> operands internally. Where and when rounding takes place depends
>>>> on when
>>>>>> those are moved in/out of the FPU, and will be different than
>>>> true 64
>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>>>> behave better. It should also be the default in 64 bit mode since
>>>> the
>>>>>> compiler can assume some basic SSE support is present.
>>>>>>
>>>>> What about FP emulation using integers and some kind of multiple
>>>>> precision
>>>>> arithmetic? Then every detail could be modeled, including x87
>>>> "floats"
>>>>> and
>>>>> "doubles" (in registers exponent field is still 15 bits, not 8/11 and
>>>>> makes
>>>>> mess of overflow/underflow, or it will go in memory and will be
>>>> proper
>>>>> float/double). Gcc has some switches regarding that behavior but
>>>> that is
>>>>> very fragile (more like suggestion to compiler then enforcing
>>>> option).
>>>>> Double rounding in x87 is special story because double extended
>>>>> mantissa is not more than twice longer then one for double so double
>>>>> rounding can give different results compared to single rounding (this
>>>>> situation can't happen
>>>>> with float vs double). One solution, for example: splitting mantissas
>>>>> into to halves and performing operation, all bits would be available
>>>>> and then proper any kind of rounding could be enforced (real ieee or
>>>>> "isa style ieee"). Performing those operations is not very slow
>>>> and it
>>>>> is fairly ILP reach so slowdown is not that great as when pure number
>>>>> of instructions is compared (although to have robust code, cpu and
>>>>> compiler independence, specially about "optimizing code" some tests
>>>>> are needed to eradicate subnormals due poor support/trap emulation).
>>>>> Plus if instructions are mixed in right way both int and fpu units
>>>> can
>>>>> be kept busy. Exponent can be one short and problem solved. Only
>>>>> division can be somewhattricky (and slow), but it can be done too.
>>>>>
>>>>>
>>>>>>> Even if the FP rounding error isn't the source of the problem,
>>>> it might
>>>>>>> be
>>>>>>> easiest to fix that and get it out of the way so we can see what
>>>> the
>>>>>>> actual
>>>>>>> problem is.
>>>>>>>
>>>>>>> If you really want to know *why* the kernel is doing all this
>>>> FP, then
>>>>>>> yes,
>>>>>>> you probably need to look at the source code.
>>>>>>>
>>>>>>> Steve
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Ali Saidi
2011-10-29 20:53:07 UTC
Permalink
I was just about to send a message about -frounding-math when I saw yours. Interesting that the asm barriers appears to work with ARM. It feels like there should be an explicit code motion barrier. Anyway, have we tried compiling with the -frounding-math flag?



Ali

Sent from my ARM powered device

On Oct 29, 2011, at 3:44 PM, Gabe Black <***@eecs.umich.edu> wrote:

> Here's a discussion on the gcc mailing list of the thing I was talking
> about before that's supposed to fix this, I think.
>
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
>
> Our barriers aren't working since Frs1s, Frs2s, and Frds could all be
> registers.
>
> Gabe
>
> On 10/29/11 13:31, Gabe Black wrote:
>> Here is some suspect assembly from Fadds for the atomic simple CPU
>>
>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
>> 0x00000000008d5393 <+387>: mov %eax,%r15d
>> 0x00000000008d5396 <+390>: mov %r14d,%edi
>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
>> 0x00000000008d539e <+398>: mov %r15d,%edi
>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
>>
>>
>> This is, more or less, from the following code.
>>
>>
>> __asm__ __volatile__ ("" ::: "memory");
>> int oldrnd = m5_fegetround();
>> __asm__ __volatile__ ("" ::: "memory");
>> m5_fesetround(newrnd);
>> __asm__ __volatile__ ("" ::: "memory");
>> Frds = Frs1s + Frs2s;
>> __asm__ __volatile__ ("" ::: "memory");
>> m5_fesetround(oldrnd);
>> __asm__ __volatile__ ("" ::: "memory");
>>
>>
>> Note that the addition was moved out of the middle and fesetround was
>> called twice back to back, once to set the new rounding mode, and once
>> to set it right back again.
>>
>> Gabe
>>
>> On 10/28/11 08:31, Ali Saidi wrote:
>>> I'm still not 100% convinced that this is it. I agree it's highly
>>> likely, but it could be some other code movement or a bug in the
>>> optimizer (we have seen them before). I wonder if you can selectively
>>> optimize functions. Maybe a good start is compiling everything -O3
>>> except the atomic execute function and make sure it still works.
>>>
>>> Ali
>>>
>>>
>>>
>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <***@gmail.com>
>>> wrote:
>>>> Yes, I think there exists at least one software IEEE FP
>>>> implementation out
>>>> there that we had talked about incorporating at some point (long ago).
>>>> Unfortunately, as is discussed below, that's not even the issue, as we
>>>> really want to model the not-quite-IEEE (or in the case of x87,
>>>> not-even-close) semantics of the hardware alone, which would require
>>>> more
>>>> effort.
>>>>
>>>> If someone really cared about modeling the ISA FP support precisely that
>>>> would be an interesting project, and if it was done cleanly (probably
>>>> with
>>>> the option to turn it on or off) we'd be glad to incorporate it.
>>>>
>>>> Ironically I think the issue here is not that the HW FP is not good
>>>> enough
>>>> for our purposes, it's that the software stack doesn't give us clean
>>>> enough
>>>> access to the HW facilities (gcc in particular, though C itself may
>>>> share
>>>> part of the blame).
>>>>
>>>> Steve
>>>>
>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu>
>>>> wrote:
>>>>
>>>>> I think there was talk of an FP emulation library a long time ago
>>>>> (before I was involved with M5) but we decided not to do something like
>>>>> that for some reason. Using regular built in FP support gets us most of
>>>>> the way with minimal hassle, but then there are situations like this
>>>>> where it really causes trouble. I presume the prior discussion might
>>>>> have been about whether getting most of the way there was good enough,
>>>>> and that it's simpler.
>>>>>
>>>>> Gabe
>>>>>
>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>>>>>> ----- Original Message ----- From: "Gabe Black"
>>>>> <***@eecs.umich.edu>
>>>>>> To: <gem5-***@gem5.org>
>>>>>> Sent: 25. октобар 2011 20:53
>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
>>>>>>
>>>>>>
>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>>>>>>>> wrote:
>>>>>> [snip]
>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather
>>>>> than a
>>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
>>>>> default
>>>>>>> rounding mode *is* standard, though, and I don't think is
>>>>> adjusted in
>>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>>>>> Whether instructions generate +/- 0 in various situations may
>>>>> depend on,
>>>>>>> for instance, what order gcc decides to put the operands. I'm not
>>>>> sure
>>>>>>> that it does, but there are all kinds of weird, subtle behaviors
>>>>> with
>>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
>>>>> thing.
>>>>>>> Then you have to replace add, or semi-replace it by faking it out
>>>>> with
>>>>>>> other FP operations. If we're running real x87 instructions (we
>>>>>>> shouldn't be in 64 bit mode, but we still could) then those use
>>>>> 80 bit
>>>>>>> operands internally. Where and when rounding takes place depends
>>>>> on when
>>>>>>> those are moved in/out of the FPU, and will be different than
>>>>> true 64
>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>>>>> behave better. It should also be the default in 64 bit mode since
>>>>> the
>>>>>>> compiler can assume some basic SSE support is present.
>>>>>>>
>>>>>> What about FP emulation using integers and some kind of multiple
>>>>>> precision
>>>>>> arithmetic? Then every detail could be modeled, including x87
>>>>> "floats"
>>>>>> and
>>>>>> "doubles" (in registers exponent field is still 15 bits, not 8/11 and
>>>>>> makes
>>>>>> mess of overflow/underflow, or it will go in memory and will be
>>>>> proper
>>>>>> float/double). Gcc has some switches regarding that behavior but
>>>>> that is
>>>>>> very fragile (more like suggestion to compiler then enforcing
>>>>> option).
>>>>>> Double rounding in x87 is special story because double extended
>>>>>> mantissa is not more than twice longer then one for double so double
>>>>>> rounding can give different results compared to single rounding (this
>>>>>> situation can't happen
>>>>>> with float vs double). One solution, for example: splitting mantissas
>>>>>> into to halves and performing operation, all bits would be available
>>>>>> and then proper any kind of rounding could be enforced (real ieee or
>>>>>> "isa style ieee"). Performing those operations is not very slow
>>>>> and it
>>>>>> is fairly ILP reach so slowdown is not that great as when pure number
>>>>>> of instructions is compared (although to have robust code, cpu and
>>>>>> compiler independence, specially about "optimizing code" some tests
>>>>>> are needed to eradicate subnormals due poor support/trap emulation).
>>>>>> Plus if instructions are mixed in right way both int and fpu units
>>>>> can
>>>>>> be kept busy. Exponent can be one short and problem solved. Only
>>>>>> division can be somewhattricky (and slow), but it can be done too.
>>>>>>
>>>>>>
>>>>>>>> Even if the FP rounding error isn't the source of the problem,
>>>>> it might
>>>>>>>> be
>>>>>>>> easiest to fix that and get it out of the way so we can see what
>>>>> the
>>>>>>>> actual
>>>>>>>> problem is.
>>>>>>>>
>>>>>>>> If you really want to know *why* the kernel is doing all this
>>>>> FP, then
>>>>>>>> yes,
>>>>>>>> you probably need to look at the source code.
>>>>>>>>
>>>>>>>> Steve
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-10-29 21:21:49 UTC
Permalink
Yes, it doesn't work either. What makes the ARM asm statements work is
that they have input and output arguments. That ties them into the data
flow graph having to do with those values, and they act as anchors,
forcing values to be produced by the time you get to the asm and not to
be consumed before it. Here we're just saying not to trust memory from
before the asm, and since it's not *in* memory, the compiler merrily
ignores us. I had this problem with ARM initially too until I added the
arguments. I've tried making floating point variables volatile to ensure
they're in memory, and that doesn't work either. I think the actual
semantics of volatile are a little different than what most people
assume, although I don't remember what the distinction is. One option
might be to make the FP operation itself a virtual function. Then gcc
won't know what it does and will be less able to break things by moving
things around.

It seems like a pretty severe deficiency of gcc that there's no way to
make fesetround work properly. It becomes nearly worthless because you
can't make any assumptions about when it will actually be in effect.
That's what we have to work with, though.

Gabe

On 10/29/11 13:53, Ali Saidi wrote:
> I was just about to send a message about -frounding-math when I saw yours. Interesting that the asm barriers appears to work with ARM. It feels like there should be an explicit code motion barrier. Anyway, have we tried compiling with the -frounding-math flag?
>
>
>
> Ali
>
> Sent from my ARM powered device
>
> On Oct 29, 2011, at 3:44 PM, Gabe Black <***@eecs.umich.edu> wrote:
>
>> Here's a discussion on the gcc mailing list of the thing I was talking
>> about before that's supposed to fix this, I think.
>>
>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
>>
>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all be
>> registers.
>>
>> Gabe
>>
>> On 10/29/11 13:31, Gabe Black wrote:
>>> Here is some suspect assembly from Fadds for the atomic simple CPU
>>>
>>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
>>> 0x00000000008d5393 <+387>: mov %eax,%r15d
>>> 0x00000000008d5396 <+390>: mov %r14d,%edi
>>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
>>> 0x00000000008d539e <+398>: mov %r15d,%edi
>>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
>>>
>>>
>>> This is, more or less, from the following code.
>>>
>>>
>>> __asm__ __volatile__ ("" ::: "memory");
>>> int oldrnd = m5_fegetround();
>>> __asm__ __volatile__ ("" ::: "memory");
>>> m5_fesetround(newrnd);
>>> __asm__ __volatile__ ("" ::: "memory");
>>> Frds = Frs1s + Frs2s;
>>> __asm__ __volatile__ ("" ::: "memory");
>>> m5_fesetround(oldrnd);
>>> __asm__ __volatile__ ("" ::: "memory");
>>>
>>>
>>> Note that the addition was moved out of the middle and fesetround was
>>> called twice back to back, once to set the new rounding mode, and once
>>> to set it right back again.
>>>
>>> Gabe
>>>
>>> On 10/28/11 08:31, Ali Saidi wrote:
>>>> I'm still not 100% convinced that this is it. I agree it's highly
>>>> likely, but it could be some other code movement or a bug in the
>>>> optimizer (we have seen them before). I wonder if you can selectively
>>>> optimize functions. Maybe a good start is compiling everything -O3
>>>> except the atomic execute function and make sure it still works.
>>>>
>>>> Ali
>>>>
>>>>
>>>>
>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <***@gmail.com>
>>>> wrote:
>>>>> Yes, I think there exists at least one software IEEE FP
>>>>> implementation out
>>>>> there that we had talked about incorporating at some point (long ago).
>>>>> Unfortunately, as is discussed below, that's not even the issue, as we
>>>>> really want to model the not-quite-IEEE (or in the case of x87,
>>>>> not-even-close) semantics of the hardware alone, which would require
>>>>> more
>>>>> effort.
>>>>>
>>>>> If someone really cared about modeling the ISA FP support precisely that
>>>>> would be an interesting project, and if it was done cleanly (probably
>>>>> with
>>>>> the option to turn it on or off) we'd be glad to incorporate it.
>>>>>
>>>>> Ironically I think the issue here is not that the HW FP is not good
>>>>> enough
>>>>> for our purposes, it's that the software stack doesn't give us clean
>>>>> enough
>>>>> access to the HW facilities (gcc in particular, though C itself may
>>>>> share
>>>>> part of the blame).
>>>>>
>>>>> Steve
>>>>>
>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu>
>>>>> wrote:
>>>>>
>>>>>> I think there was talk of an FP emulation library a long time ago
>>>>>> (before I was involved with M5) but we decided not to do something like
>>>>>> that for some reason. Using regular built in FP support gets us most of
>>>>>> the way with minimal hassle, but then there are situations like this
>>>>>> where it really causes trouble. I presume the prior discussion might
>>>>>> have been about whether getting most of the way there was good enough,
>>>>>> and that it's simpler.
>>>>>>
>>>>>> Gabe
>>>>>>
>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>>>>>>> ----- Original Message ----- From: "Gabe Black"
>>>>>> <***@eecs.umich.edu>
>>>>>>> To: <gem5-***@gem5.org>
>>>>>>> Sent: 25. октобар 2011 20:53
>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
>>>>>>>
>>>>>>>
>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>>>>>>>>> wrote:
>>>>>>> [snip]
>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather
>>>>>> than a
>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
>>>>>> default
>>>>>>>> rounding mode *is* standard, though, and I don't think is
>>>>>> adjusted in
>>>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>>>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>>>>>> Whether instructions generate +/- 0 in various situations may
>>>>>> depend on,
>>>>>>>> for instance, what order gcc decides to put the operands. I'm not
>>>>>> sure
>>>>>>>> that it does, but there are all kinds of weird, subtle behaviors
>>>>>> with
>>>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
>>>>>> thing.
>>>>>>>> Then you have to replace add, or semi-replace it by faking it out
>>>>>> with
>>>>>>>> other FP operations. If we're running real x87 instructions (we
>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those use
>>>>>> 80 bit
>>>>>>>> operands internally. Where and when rounding takes place depends
>>>>>> on when
>>>>>>>> those are moved in/out of the FPU, and will be different than
>>>>>> true 64
>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>>>>>> behave better. It should also be the default in 64 bit mode since
>>>>>> the
>>>>>>>> compiler can assume some basic SSE support is present.
>>>>>>>>
>>>>>>> What about FP emulation using integers and some kind of multiple
>>>>>>> precision
>>>>>>> arithmetic? Then every detail could be modeled, including x87
>>>>>> "floats"
>>>>>>> and
>>>>>>> "doubles" (in registers exponent field is still 15 bits, not 8/11 and
>>>>>>> makes
>>>>>>> mess of overflow/underflow, or it will go in memory and will be
>>>>>> proper
>>>>>>> float/double). Gcc has some switches regarding that behavior but
>>>>>> that is
>>>>>>> very fragile (more like suggestion to compiler then enforcing
>>>>>> option).
>>>>>>> Double rounding in x87 is special story because double extended
>>>>>>> mantissa is not more than twice longer then one for double so double
>>>>>>> rounding can give different results compared to single rounding (this
>>>>>>> situation can't happen
>>>>>>> with float vs double). One solution, for example: splitting mantissas
>>>>>>> into to halves and performing operation, all bits would be available
>>>>>>> and then proper any kind of rounding could be enforced (real ieee or
>>>>>>> "isa style ieee"). Performing those operations is not very slow
>>>>>> and it
>>>>>>> is fairly ILP reach so slowdown is not that great as when pure number
>>>>>>> of instructions is compared (although to have robust code, cpu and
>>>>>>> compiler independence, specially about "optimizing code" some tests
>>>>>>> are needed to eradicate subnormals due poor support/trap emulation).
>>>>>>> Plus if instructions are mixed in right way both int and fpu units
>>>>>> can
>>>>>>> be kept busy. Exponent can be one short and problem solved. Only
>>>>>>> division can be somewhattricky (and slow), but it can be done too.
>>>>>>>
>>>>>>>
>>>>>>>>> Even if the FP rounding error isn't the source of the problem,
>>>>>> it might
>>>>>>>>> be
>>>>>>>>> easiest to fix that and get it out of the way so we can see what
>>>>>> the
>>>>>>>>> actual
>>>>>>>>> problem is.
>>>>>>>>>
>>>>>>>>> If you really want to know *why* the kernel is doing all this
>>>>>> FP, then
>>>>>>>>> yes,
>>>>>>>>> you probably need to look at the source code.
>>>>>>>>>
>>>>>>>>> Steve
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-dev mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-10-29 21:59:37 UTC
Permalink
http://permalink.gmane.org/gmane.comp.gcc.help/38146

On 10/29/11 14:21, Gabe Black wrote:
> Yes, it doesn't work either. What makes the ARM asm statements work is
> that they have input and output arguments. That ties them into the data
> flow graph having to do with those values, and they act as anchors,
> forcing values to be produced by the time you get to the asm and not to
> be consumed before it. Here we're just saying not to trust memory from
> before the asm, and since it's not *in* memory, the compiler merrily
> ignores us. I had this problem with ARM initially too until I added the
> arguments. I've tried making floating point variables volatile to ensure
> they're in memory, and that doesn't work either. I think the actual
> semantics of volatile are a little different than what most people
> assume, although I don't remember what the distinction is. One option
> might be to make the FP operation itself a virtual function. Then gcc
> won't know what it does and will be less able to break things by moving
> things around.
>
> It seems like a pretty severe deficiency of gcc that there's no way to
> make fesetround work properly. It becomes nearly worthless because you
> can't make any assumptions about when it will actually be in effect.
> That's what we have to work with, though.
>
> Gabe
>
> On 10/29/11 13:53, Ali Saidi wrote:
>> I was just about to send a message about -frounding-math when I saw yours. Interesting that the asm barriers appears to work with ARM. It feels like there should be an explicit code motion barrier. Anyway, have we tried compiling with the -frounding-math flag?
>>
>>
>>
>> Ali
>>
>> Sent from my ARM powered device
>>
>> On Oct 29, 2011, at 3:44 PM, Gabe Black <***@eecs.umich.edu> wrote:
>>
>>> Here's a discussion on the gcc mailing list of the thing I was talking
>>> about before that's supposed to fix this, I think.
>>>
>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
>>>
>>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all be
>>> registers.
>>>
>>> Gabe
>>>
>>> On 10/29/11 13:31, Gabe Black wrote:
>>>> Here is some suspect assembly from Fadds for the atomic simple CPU
>>>>
>>>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
>>>> 0x00000000008d5393 <+387>: mov %eax,%r15d
>>>> 0x00000000008d5396 <+390>: mov %r14d,%edi
>>>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
>>>> 0x00000000008d539e <+398>: mov %r15d,%edi
>>>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
>>>>
>>>>
>>>> This is, more or less, from the following code.
>>>>
>>>>
>>>> __asm__ __volatile__ ("" ::: "memory");
>>>> int oldrnd = m5_fegetround();
>>>> __asm__ __volatile__ ("" ::: "memory");
>>>> m5_fesetround(newrnd);
>>>> __asm__ __volatile__ ("" ::: "memory");
>>>> Frds = Frs1s + Frs2s;
>>>> __asm__ __volatile__ ("" ::: "memory");
>>>> m5_fesetround(oldrnd);
>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>
>>>>
>>>> Note that the addition was moved out of the middle and fesetround was
>>>> called twice back to back, once to set the new rounding mode, and once
>>>> to set it right back again.
>>>>
>>>> Gabe
>>>>
>>>> On 10/28/11 08:31, Ali Saidi wrote:
>>>>> I'm still not 100% convinced that this is it. I agree it's highly
>>>>> likely, but it could be some other code movement or a bug in the
>>>>> optimizer (we have seen them before). I wonder if you can selectively
>>>>> optimize functions. Maybe a good start is compiling everything -O3
>>>>> except the atomic execute function and make sure it still works.
>>>>>
>>>>> Ali
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <***@gmail.com>
>>>>> wrote:
>>>>>> Yes, I think there exists at least one software IEEE FP
>>>>>> implementation out
>>>>>> there that we had talked about incorporating at some point (long ago).
>>>>>> Unfortunately, as is discussed below, that's not even the issue, as we
>>>>>> really want to model the not-quite-IEEE (or in the case of x87,
>>>>>> not-even-close) semantics of the hardware alone, which would require
>>>>>> more
>>>>>> effort.
>>>>>>
>>>>>> If someone really cared about modeling the ISA FP support precisely that
>>>>>> would be an interesting project, and if it was done cleanly (probably
>>>>>> with
>>>>>> the option to turn it on or off) we'd be glad to incorporate it.
>>>>>>
>>>>>> Ironically I think the issue here is not that the HW FP is not good
>>>>>> enough
>>>>>> for our purposes, it's that the software stack doesn't give us clean
>>>>>> enough
>>>>>> access to the HW facilities (gcc in particular, though C itself may
>>>>>> share
>>>>>> part of the blame).
>>>>>>
>>>>>> Steve
>>>>>>
>>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> I think there was talk of an FP emulation library a long time ago
>>>>>>> (before I was involved with M5) but we decided not to do something like
>>>>>>> that for some reason. Using regular built in FP support gets us most of
>>>>>>> the way with minimal hassle, but then there are situations like this
>>>>>>> where it really causes trouble. I presume the prior discussion might
>>>>>>> have been about whether getting most of the way there was good enough,
>>>>>>> and that it's simpler.
>>>>>>>
>>>>>>> Gabe
>>>>>>>
>>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>>>>>>>> ----- Original Message ----- From: "Gabe Black"
>>>>>>> <***@eecs.umich.edu>
>>>>>>>> To: <gem5-***@gem5.org>
>>>>>>>> Sent: 25. октобар 2011 20:53
>>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>>>>>>>>>> wrote:
>>>>>>>> [snip]
>>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather
>>>>>>> than a
>>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
>>>>>>> default
>>>>>>>>> rounding mode *is* standard, though, and I don't think is
>>>>>>> adjusted in
>>>>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>>>>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>>>>>>> Whether instructions generate +/- 0 in various situations may
>>>>>>> depend on,
>>>>>>>>> for instance, what order gcc decides to put the operands. I'm not
>>>>>>> sure
>>>>>>>>> that it does, but there are all kinds of weird, subtle behaviors
>>>>>>> with
>>>>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
>>>>>>> thing.
>>>>>>>>> Then you have to replace add, or semi-replace it by faking it out
>>>>>>> with
>>>>>>>>> other FP operations. If we're running real x87 instructions (we
>>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those use
>>>>>>> 80 bit
>>>>>>>>> operands internally. Where and when rounding takes place depends
>>>>>>> on when
>>>>>>>>> those are moved in/out of the FPU, and will be different than
>>>>>>> true 64
>>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>>>>>>> behave better. It should also be the default in 64 bit mode since
>>>>>>> the
>>>>>>>>> compiler can assume some basic SSE support is present.
>>>>>>>>>
>>>>>>>> What about FP emulation using integers and some kind of multiple
>>>>>>>> precision
>>>>>>>> arithmetic? Then every detail could be modeled, including x87
>>>>>>> "floats"
>>>>>>>> and
>>>>>>>> "doubles" (in registers exponent field is still 15 bits, not 8/11 and
>>>>>>>> makes
>>>>>>>> mess of overflow/underflow, or it will go in memory and will be
>>>>>>> proper
>>>>>>>> float/double). Gcc has some switches regarding that behavior but
>>>>>>> that is
>>>>>>>> very fragile (more like suggestion to compiler then enforcing
>>>>>>> option).
>>>>>>>> Double rounding in x87 is special story because double extended
>>>>>>>> mantissa is not more than twice longer then one for double so double
>>>>>>>> rounding can give different results compared to single rounding (this
>>>>>>>> situation can't happen
>>>>>>>> with float vs double). One solution, for example: splitting mantissas
>>>>>>>> into to halves and performing operation, all bits would be available
>>>>>>>> and then proper any kind of rounding could be enforced (real ieee or
>>>>>>>> "isa style ieee"). Performing those operations is not very slow
>>>>>>> and it
>>>>>>>> is fairly ILP reach so slowdown is not that great as when pure number
>>>>>>>> of instructions is compared (although to have robust code, cpu and
>>>>>>>> compiler independence, specially about "optimizing code" some tests
>>>>>>>> are needed to eradicate subnormals due poor support/trap emulation).
>>>>>>>> Plus if instructions are mixed in right way both int and fpu units
>>>>>>> can
>>>>>>>> be kept busy. Exponent can be one short and problem solved. Only
>>>>>>>> division can be somewhattricky (and slow), but it can be done too.
>>>>>>>>
>>>>>>>>
>>>>>>>>>> Even if the FP rounding error isn't the source of the problem,
>>>>>>> it might
>>>>>>>>>> be
>>>>>>>>>> easiest to fix that and get it out of the way so we can see what
>>>>>>> the
>>>>>>>>>> actual
>>>>>>>>>> problem is.
>>>>>>>>>>
>>>>>>>>>> If you really want to know *why* the kernel is doing all this
>>>>>>> FP, then
>>>>>>>>>> yes,
>>>>>>>>>> you probably need to look at the source code.
>>>>>>>>>>
>>>>>>>>>> Steve
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-dev mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Ali Saidi
2011-10-29 23:30:35 UTC
Permalink
What about making m5_fesetround and m5_fegetround() modify memory and thus prevent reordering?

Something like:

volatile int dummy_compiler;

void m5_fesetround(int rm)
{
assert(rm >= 0 && rm < 4);
dummy_compiler++;
fesetround(m5_round_ops[rm]);
dummy_compiler++;
}

int m5_fegetround()
{
int x;
dummy_compiler++;
int rm = fegetround();
dummy_compiler++;
for(x = 0; x < 4; x++)
if (m5_round_ops[x] == rm)
return x;
abort();
return 0;
}

Would that just fix it? Mabye m5_round_ops and rm could be made volatile instead?

Another possible solution and hack, but I think we're into hack territory no matter what since gcc seems brain damaged in this regard:

#if __GNUC__ > 3 && __GNUC_MINOR__ > 3 // 4.4 or newer
#pragma GCC push_options
#pragma GCC optimize ("O0")

// m5_fe* goes here

#pragma GCC pop_options
#endif


A third option would be something like

void __attribute__((optimize("O0")) m5_fesetround(int rm)...

Ali


On Oct 29, 2011, at 4:59 PM, Gabe Black wrote:

> http://permalink.gmane.org/gmane.comp.gcc.help/38146
>
> On 10/29/11 14:21, Gabe Black wrote:
>> Yes, it doesn't work either. What makes the ARM asm statements work is
>> that they have input and output arguments. That ties them into the data
>> flow graph having to do with those values, and they act as anchors,
>> forcing values to be produced by the time you get to the asm and not to
>> be consumed before it. Here we're just saying not to trust memory from
>> before the asm, and since it's not *in* memory, the compiler merrily
>> ignores us. I had this problem with ARM initially too until I added the
>> arguments. I've tried making floating point variables volatile to ensure
>> they're in memory, and that doesn't work either. I think the actual
>> semantics of volatile are a little different than what most people
>> assume, although I don't remember what the distinction is. One option
>> might be to make the FP operation itself a virtual function. Then gcc
>> won't know what it does and will be less able to break things by moving
>> things around.
>>
>> It seems like a pretty severe deficiency of gcc that there's no way to
>> make fesetround work properly. It becomes nearly worthless because you
>> can't make any assumptions about when it will actually be in effect.
>> That's what we have to work with, though.
>>
>> Gabe
>>
>> On 10/29/11 13:53, Ali Saidi wrote:
>>> I was just about to send a message about -frounding-math when I saw yours. Interesting that the asm barriers appears to work with ARM. It feels like there should be an explicit code motion barrier. Anyway, have we tried compiling with the -frounding-math flag?
>>>
>>>
>>>
>>> Ali
>>>
>>> Sent from my ARM powered device
>>>
>>> On Oct 29, 2011, at 3:44 PM, Gabe Black <***@eecs.umich.edu> wrote:
>>>
>>>> Here's a discussion on the gcc mailing list of the thing I was talking
>>>> about before that's supposed to fix this, I think.
>>>>
>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
>>>>
>>>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all be
>>>> registers.
>>>>
>>>> Gabe
>>>>
>>>> On 10/29/11 13:31, Gabe Black wrote:
>>>>> Here is some suspect assembly from Fadds for the atomic simple CPU
>>>>>
>>>>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
>>>>> 0x00000000008d5393 <+387>: mov %eax,%r15d
>>>>> 0x00000000008d5396 <+390>: mov %r14d,%edi
>>>>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
>>>>> 0x00000000008d539e <+398>: mov %r15d,%edi
>>>>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
>>>>>
>>>>>
>>>>> This is, more or less, from the following code.
>>>>>
>>>>>
>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>> int oldrnd = m5_fegetround();
>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>> m5_fesetround(newrnd);
>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>> Frds = Frs1s + Frs2s;
>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>> m5_fesetround(oldrnd);
>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>
>>>>>
>>>>> Note that the addition was moved out of the middle and fesetround was
>>>>> called twice back to back, once to set the new rounding mode, and once
>>>>> to set it right back again.
>>>>>
>>>>> Gabe
>>>>>
>>>>> On 10/28/11 08:31, Ali Saidi wrote:
>>>>>> I'm still not 100% convinced that this is it. I agree it's highly
>>>>>> likely, but it could be some other code movement or a bug in the
>>>>>> optimizer (we have seen them before). I wonder if you can selectively
>>>>>> optimize functions. Maybe a good start is compiling everything -O3
>>>>>> except the atomic execute function and make sure it still works.
>>>>>>
>>>>>> Ali
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <***@gmail.com>
>>>>>> wrote:
>>>>>>> Yes, I think there exists at least one software IEEE FP
>>>>>>> implementation out
>>>>>>> there that we had talked about incorporating at some point (long ago).
>>>>>>> Unfortunately, as is discussed below, that's not even the issue, as we
>>>>>>> really want to model the not-quite-IEEE (or in the case of x87,
>>>>>>> not-even-close) semantics of the hardware alone, which would require
>>>>>>> more
>>>>>>> effort.
>>>>>>>
>>>>>>> If someone really cared about modeling the ISA FP support precisely that
>>>>>>> would be an interesting project, and if it was done cleanly (probably
>>>>>>> with
>>>>>>> the option to turn it on or off) we'd be glad to incorporate it.
>>>>>>>
>>>>>>> Ironically I think the issue here is not that the HW FP is not good
>>>>>>> enough
>>>>>>> for our purposes, it's that the software stack doesn't give us clean
>>>>>>> enough
>>>>>>> access to the HW facilities (gcc in particular, though C itself may
>>>>>>> share
>>>>>>> part of the blame).
>>>>>>>
>>>>>>> Steve
>>>>>>>
>>>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think there was talk of an FP emulation library a long time ago
>>>>>>>> (before I was involved with M5) but we decided not to do something like
>>>>>>>> that for some reason. Using regular built in FP support gets us most of
>>>>>>>> the way with minimal hassle, but then there are situations like this
>>>>>>>> where it really causes trouble. I presume the prior discussion might
>>>>>>>> have been about whether getting most of the way there was good enough,
>>>>>>>> and that it's simpler.
>>>>>>>>
>>>>>>>> Gabe
>>>>>>>>
>>>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>>>>>>>>> ----- Original Message ----- From: "Gabe Black"
>>>>>>>> <***@eecs.umich.edu>
>>>>>>>>> To: <gem5-***@gem5.org>
>>>>>>>>> Sent: 25. октобар 2011 20:53
>>>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>>>>>>>>>>> wrote:
>>>>>>>>> [snip]
>>>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather
>>>>>>>> than a
>>>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
>>>>>>>> default
>>>>>>>>>> rounding mode *is* standard, though, and I don't think is
>>>>>>>> adjusted in
>>>>>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>>>>>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>>>>>>>> Whether instructions generate +/- 0 in various situations may
>>>>>>>> depend on,
>>>>>>>>>> for instance, what order gcc decides to put the operands. I'm not
>>>>>>>> sure
>>>>>>>>>> that it does, but there are all kinds of weird, subtle behaviors
>>>>>>>> with
>>>>>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
>>>>>>>> thing.
>>>>>>>>>> Then you have to replace add, or semi-replace it by faking it out
>>>>>>>> with
>>>>>>>>>> other FP operations. If we're running real x87 instructions (we
>>>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those use
>>>>>>>> 80 bit
>>>>>>>>>> operands internally. Where and when rounding takes place depends
>>>>>>>> on when
>>>>>>>>>> those are moved in/out of the FPU, and will be different than
>>>>>>>> true 64
>>>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>>>>>>>> behave better. It should also be the default in 64 bit mode since
>>>>>>>> the
>>>>>>>>>> compiler can assume some basic SSE support is present.
>>>>>>>>>>
>>>>>>>>> What about FP emulation using integers and some kind of multiple
>>>>>>>>> precision
>>>>>>>>> arithmetic? Then every detail could be modeled, including x87
>>>>>>>> "floats"
>>>>>>>>> and
>>>>>>>>> "doubles" (in registers exponent field is still 15 bits, not 8/11 and
>>>>>>>>> makes
>>>>>>>>> mess of overflow/underflow, or it will go in memory and will be
>>>>>>>> proper
>>>>>>>>> float/double). Gcc has some switches regarding that behavior but
>>>>>>>> that is
>>>>>>>>> very fragile (more like suggestion to compiler then enforcing
>>>>>>>> option).
>>>>>>>>> Double rounding in x87 is special story because double extended
>>>>>>>>> mantissa is not more than twice longer then one for double so double
>>>>>>>>> rounding can give different results compared to single rounding (this
>>>>>>>>> situation can't happen
>>>>>>>>> with float vs double). One solution, for example: splitting mantissas
>>>>>>>>> into to halves and performing operation, all bits would be available
>>>>>>>>> and then proper any kind of rounding could be enforced (real ieee or
>>>>>>>>> "isa style ieee"). Performing those operations is not very slow
>>>>>>>> and it
>>>>>>>>> is fairly ILP reach so slowdown is not that great as when pure number
>>>>>>>>> of instructions is compared (although to have robust code, cpu and
>>>>>>>>> compiler independence, specially about "optimizing code" some tests
>>>>>>>>> are needed to eradicate subnormals due poor support/trap emulation).
>>>>>>>>> Plus if instructions are mixed in right way both int and fpu units
>>>>>>>> can
>>>>>>>>> be kept busy. Exponent can be one short and problem solved. Only
>>>>>>>>> division can be somewhattricky (and slow), but it can be done too.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> Even if the FP rounding error isn't the source of the problem,
>>>>>>>> it might
>>>>>>>>>>> be
>>>>>>>>>>> easiest to fix that and get it out of the way so we can see what
>>>>>>>> the
>>>>>>>>>>> actual
>>>>>>>>>>> problem is.
>>>>>>>>>>>
>>>>>>>>>>> If you really want to know *why* the kernel is doing all this
>>>>>>>> FP, then
>>>>>>>>>>> yes,
>>>>>>>>>>> you probably need to look at the source code.
>>>>>>>>>>>
>>>>>>>>>>> Steve
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-dev mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Ali Saidi
2011-10-29 23:31:58 UTC
Permalink
If we go down the path below, slighly less hacky might be just making gcc compiler the entire fenv file without optimization, although perhaps that is insufficient....

Ali

On Oct 29, 2011, at 6:30 PM, Ali Saidi wrote:

> What about making m5_fesetround and m5_fegetround() modify memory and thus prevent reordering?
>
> Something like:
>
> volatile int dummy_compiler;
>
> void m5_fesetround(int rm)
> {
> assert(rm >= 0 && rm < 4);
> dummy_compiler++;
> fesetround(m5_round_ops[rm]);
> dummy_compiler++;
> }
>
> int m5_fegetround()
> {
> int x;
> dummy_compiler++;
> int rm = fegetround();
> dummy_compiler++;
> for(x = 0; x < 4; x++)
> if (m5_round_ops[x] == rm)
> return x;
> abort();
> return 0;
> }
>
> Would that just fix it? Mabye m5_round_ops and rm could be made volatile instead?
>
> Another possible solution and hack, but I think we're into hack territory no matter what since gcc seems brain damaged in this regard:
>
> #if __GNUC__ > 3 && __GNUC_MINOR__ > 3 // 4.4 or newer
> #pragma GCC push_options
> #pragma GCC optimize ("O0")
>
> // m5_fe* goes here
>
> #pragma GCC pop_options
> #endif
>
>
> A third option would be something like
>
> void __attribute__((optimize("O0")) m5_fesetround(int rm)...
>
> Ali
>
>
> On Oct 29, 2011, at 4:59 PM, Gabe Black wrote:
>
>> http://permalink.gmane.org/gmane.comp.gcc.help/38146
>>
>> On 10/29/11 14:21, Gabe Black wrote:
>>> Yes, it doesn't work either. What makes the ARM asm statements work is
>>> that they have input and output arguments. That ties them into the data
>>> flow graph having to do with those values, and they act as anchors,
>>> forcing values to be produced by the time you get to the asm and not to
>>> be consumed before it. Here we're just saying not to trust memory from
>>> before the asm, and since it's not *in* memory, the compiler merrily
>>> ignores us. I had this problem with ARM initially too until I added the
>>> arguments. I've tried making floating point variables volatile to ensure
>>> they're in memory, and that doesn't work either. I think the actual
>>> semantics of volatile are a little different than what most people
>>> assume, although I don't remember what the distinction is. One option
>>> might be to make the FP operation itself a virtual function. Then gcc
>>> won't know what it does and will be less able to break things by moving
>>> things around.
>>>
>>> It seems like a pretty severe deficiency of gcc that there's no way to
>>> make fesetround work properly. It becomes nearly worthless because you
>>> can't make any assumptions about when it will actually be in effect.
>>> That's what we have to work with, though.
>>>
>>> Gabe
>>>
>>> On 10/29/11 13:53, Ali Saidi wrote:
>>>> I was just about to send a message about -frounding-math when I saw yours. Interesting that the asm barriers appears to work with ARM. It feels like there should be an explicit code motion barrier. Anyway, have we tried compiling with the -frounding-math flag?
>>>>
>>>>
>>>>
>>>> Ali
>>>>
>>>> Sent from my ARM powered device
>>>>
>>>> On Oct 29, 2011, at 3:44 PM, Gabe Black <***@eecs.umich.edu> wrote:
>>>>
>>>>> Here's a discussion on the gcc mailing list of the thing I was talking
>>>>> about before that's supposed to fix this, I think.
>>>>>
>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
>>>>>
>>>>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all be
>>>>> registers.
>>>>>
>>>>> Gabe
>>>>>
>>>>> On 10/29/11 13:31, Gabe Black wrote:
>>>>>> Here is some suspect assembly from Fadds for the atomic simple CPU
>>>>>>
>>>>>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
>>>>>> 0x00000000008d5393 <+387>: mov %eax,%r15d
>>>>>> 0x00000000008d5396 <+390>: mov %r14d,%edi
>>>>>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
>>>>>> 0x00000000008d539e <+398>: mov %r15d,%edi
>>>>>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
>>>>>>
>>>>>>
>>>>>> This is, more or less, from the following code.
>>>>>>
>>>>>>
>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>> int oldrnd = m5_fegetround();
>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>> m5_fesetround(newrnd);
>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>> Frds = Frs1s + Frs2s;
>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>> m5_fesetround(oldrnd);
>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>
>>>>>>
>>>>>> Note that the addition was moved out of the middle and fesetround was
>>>>>> called twice back to back, once to set the new rounding mode, and once
>>>>>> to set it right back again.
>>>>>>
>>>>>> Gabe
>>>>>>
>>>>>> On 10/28/11 08:31, Ali Saidi wrote:
>>>>>>> I'm still not 100% convinced that this is it. I agree it's highly
>>>>>>> likely, but it could be some other code movement or a bug in the
>>>>>>> optimizer (we have seen them before). I wonder if you can selectively
>>>>>>> optimize functions. Maybe a good start is compiling everything -O3
>>>>>>> except the atomic execute function and make sure it still works.
>>>>>>>
>>>>>>> Ali
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <***@gmail.com>
>>>>>>> wrote:
>>>>>>>> Yes, I think there exists at least one software IEEE FP
>>>>>>>> implementation out
>>>>>>>> there that we had talked about incorporating at some point (long ago).
>>>>>>>> Unfortunately, as is discussed below, that's not even the issue, as we
>>>>>>>> really want to model the not-quite-IEEE (or in the case of x87,
>>>>>>>> not-even-close) semantics of the hardware alone, which would require
>>>>>>>> more
>>>>>>>> effort.
>>>>>>>>
>>>>>>>> If someone really cared about modeling the ISA FP support precisely that
>>>>>>>> would be an interesting project, and if it was done cleanly (probably
>>>>>>>> with
>>>>>>>> the option to turn it on or off) we'd be glad to incorporate it.
>>>>>>>>
>>>>>>>> Ironically I think the issue here is not that the HW FP is not good
>>>>>>>> enough
>>>>>>>> for our purposes, it's that the software stack doesn't give us clean
>>>>>>>> enough
>>>>>>>> access to the HW facilities (gcc in particular, though C itself may
>>>>>>>> share
>>>>>>>> part of the blame).
>>>>>>>>
>>>>>>>> Steve
>>>>>>>>
>>>>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I think there was talk of an FP emulation library a long time ago
>>>>>>>>> (before I was involved with M5) but we decided not to do something like
>>>>>>>>> that for some reason. Using regular built in FP support gets us most of
>>>>>>>>> the way with minimal hassle, but then there are situations like this
>>>>>>>>> where it really causes trouble. I presume the prior discussion might
>>>>>>>>> have been about whether getting most of the way there was good enough,
>>>>>>>>> and that it's simpler.
>>>>>>>>>
>>>>>>>>> Gabe
>>>>>>>>>
>>>>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>>>>>>>>>> ----- Original Message ----- From: "Gabe Black"
>>>>>>>>> <***@eecs.umich.edu>
>>>>>>>>>> To: <gem5-***@gem5.org>
>>>>>>>>>> Sent: 25. октобар 2011 20:53
>>>>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>> [snip]
>>>>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather
>>>>>>>>> than a
>>>>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
>>>>>>>>> default
>>>>>>>>>>> rounding mode *is* standard, though, and I don't think is
>>>>>>>>> adjusted in
>>>>>>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>>>>>>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>>>>>>>>> Whether instructions generate +/- 0 in various situations may
>>>>>>>>> depend on,
>>>>>>>>>>> for instance, what order gcc decides to put the operands. I'm not
>>>>>>>>> sure
>>>>>>>>>>> that it does, but there are all kinds of weird, subtle behaviors
>>>>>>>>> with
>>>>>>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
>>>>>>>>> thing.
>>>>>>>>>>> Then you have to replace add, or semi-replace it by faking it out
>>>>>>>>> with
>>>>>>>>>>> other FP operations. If we're running real x87 instructions (we
>>>>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those use
>>>>>>>>> 80 bit
>>>>>>>>>>> operands internally. Where and when rounding takes place depends
>>>>>>>>> on when
>>>>>>>>>>> those are moved in/out of the FPU, and will be different than
>>>>>>>>> true 64
>>>>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>>>>>>>>> behave better. It should also be the default in 64 bit mode since
>>>>>>>>> the
>>>>>>>>>>> compiler can assume some basic SSE support is present.
>>>>>>>>>>>
>>>>>>>>>> What about FP emulation using integers and some kind of multiple
>>>>>>>>>> precision
>>>>>>>>>> arithmetic? Then every detail could be modeled, including x87
>>>>>>>>> "floats"
>>>>>>>>>> and
>>>>>>>>>> "doubles" (in registers exponent field is still 15 bits, not 8/11 and
>>>>>>>>>> makes
>>>>>>>>>> mess of overflow/underflow, or it will go in memory and will be
>>>>>>>>> proper
>>>>>>>>>> float/double). Gcc has some switches regarding that behavior but
>>>>>>>>> that is
>>>>>>>>>> very fragile (more like suggestion to compiler then enforcing
>>>>>>>>> option).
>>>>>>>>>> Double rounding in x87 is special story because double extended
>>>>>>>>>> mantissa is not more than twice longer then one for double so double
>>>>>>>>>> rounding can give different results compared to single rounding (this
>>>>>>>>>> situation can't happen
>>>>>>>>>> with float vs double). One solution, for example: splitting mantissas
>>>>>>>>>> into to halves and performing operation, all bits would be available
>>>>>>>>>> and then proper any kind of rounding could be enforced (real ieee or
>>>>>>>>>> "isa style ieee"). Performing those operations is not very slow
>>>>>>>>> and it
>>>>>>>>>> is fairly ILP reach so slowdown is not that great as when pure number
>>>>>>>>>> of instructions is compared (although to have robust code, cpu and
>>>>>>>>>> compiler independence, specially about "optimizing code" some tests
>>>>>>>>>> are needed to eradicate subnormals due poor support/trap emulation).
>>>>>>>>>> Plus if instructions are mixed in right way both int and fpu units
>>>>>>>>> can
>>>>>>>>>> be kept busy. Exponent can be one short and problem solved. Only
>>>>>>>>>> division can be somewhattricky (and slow), but it can be done too.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> Even if the FP rounding error isn't the source of the problem,
>>>>>>>>> it might
>>>>>>>>>>>> be
>>>>>>>>>>>> easiest to fix that and get it out of the way so we can see what
>>>>>>>>> the
>>>>>>>>>>>> actual
>>>>>>>>>>>> problem is.
>>>>>>>>>>>>
>>>>>>>>>>>> If you really want to know *why* the kernel is doing all this
>>>>>>>>> FP, then
>>>>>>>>>>>> yes,
>>>>>>>>>>>> you probably need to look at the source code.
>>>>>>>>>>>>
>>>>>>>>>>>> Steve
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-dev mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2011-10-29 23:51:41 UTC
Permalink
I don't think either will work because it's not the optimizations in
those functions or the functions order relative to each other or the
asms, it's the position of the add relative to the asms. Since the add
can move around wherever, it doesn't matter if the calls to fesetround
are bounded by the asms. We could potentially mark the execute function
with a different optimization level though. That might work. Also, I
have that filterDoubles function in there that finds fp operands that
are doubles and builds them from or breaks them down into single floats.
We could possibly piggyback on that to add in asms with the right
properties like in ARM. It's a bit gross, but like you said I don't know
if we can avoid that.

Gabe

On 10/29/11 16:31, Ali Saidi wrote:
> If we go down the path below, slighly less hacky might be just making gcc compiler the entire fenv file without optimization, although perhaps that is insufficient....
>
> Ali
>
> On Oct 29, 2011, at 6:30 PM, Ali Saidi wrote:
>
>> What about making m5_fesetround and m5_fegetround() modify memory and thus prevent reordering?
>>
>> Something like:
>>
>> volatile int dummy_compiler;
>>
>> void m5_fesetround(int rm)
>> {
>> assert(rm >= 0 && rm < 4);
>> dummy_compiler++;
>> fesetround(m5_round_ops[rm]);
>> dummy_compiler++;
>> }
>>
>> int m5_fegetround()
>> {
>> int x;
>> dummy_compiler++;
>> int rm = fegetround();
>> dummy_compiler++;
>> for(x = 0; x < 4; x++)
>> if (m5_round_ops[x] == rm)
>> return x;
>> abort();
>> return 0;
>> }
>>
>> Would that just fix it? Mabye m5_round_ops and rm could be made volatile instead?
>>
>> Another possible solution and hack, but I think we're into hack territory no matter what since gcc seems brain damaged in this regard:
>>
>> #if __GNUC__ > 3 && __GNUC_MINOR__ > 3 // 4.4 or newer
>> #pragma GCC push_options
>> #pragma GCC optimize ("O0")
>>
>> // m5_fe* goes here
>>
>> #pragma GCC pop_options
>> #endif
>>
>>
>> A third option would be something like
>>
>> void __attribute__((optimize("O0")) m5_fesetround(int rm)...
>>
>> Ali
>>
>>
>> On Oct 29, 2011, at 4:59 PM, Gabe Black wrote:
>>
>>> http://permalink.gmane.org/gmane.comp.gcc.help/38146
>>>
>>> On 10/29/11 14:21, Gabe Black wrote:
>>>> Yes, it doesn't work either. What makes the ARM asm statements work is
>>>> that they have input and output arguments. That ties them into the data
>>>> flow graph having to do with those values, and they act as anchors,
>>>> forcing values to be produced by the time you get to the asm and not to
>>>> be consumed before it. Here we're just saying not to trust memory from
>>>> before the asm, and since it's not *in* memory, the compiler merrily
>>>> ignores us. I had this problem with ARM initially too until I added the
>>>> arguments. I've tried making floating point variables volatile to ensure
>>>> they're in memory, and that doesn't work either. I think the actual
>>>> semantics of volatile are a little different than what most people
>>>> assume, although I don't remember what the distinction is. One option
>>>> might be to make the FP operation itself a virtual function. Then gcc
>>>> won't know what it does and will be less able to break things by moving
>>>> things around.
>>>>
>>>> It seems like a pretty severe deficiency of gcc that there's no way to
>>>> make fesetround work properly. It becomes nearly worthless because you
>>>> can't make any assumptions about when it will actually be in effect.
>>>> That's what we have to work with, though.
>>>>
>>>> Gabe
>>>>
>>>> On 10/29/11 13:53, Ali Saidi wrote:
>>>>> I was just about to send a message about -frounding-math when I saw yours. Interesting that the asm barriers appears to work with ARM. It feels like there should be an explicit code motion barrier. Anyway, have we tried compiling with the -frounding-math flag?
>>>>>
>>>>>
>>>>>
>>>>> Ali
>>>>>
>>>>> Sent from my ARM powered device
>>>>>
>>>>> On Oct 29, 2011, at 3:44 PM, Gabe Black <***@eecs.umich.edu> wrote:
>>>>>
>>>>>> Here's a discussion on the gcc mailing list of the thing I was talking
>>>>>> about before that's supposed to fix this, I think.
>>>>>>
>>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
>>>>>>
>>>>>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all be
>>>>>> registers.
>>>>>>
>>>>>> Gabe
>>>>>>
>>>>>> On 10/29/11 13:31, Gabe Black wrote:
>>>>>>> Here is some suspect assembly from Fadds for the atomic simple CPU
>>>>>>>
>>>>>>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
>>>>>>> 0x00000000008d5393 <+387>: mov %eax,%r15d
>>>>>>> 0x00000000008d5396 <+390>: mov %r14d,%edi
>>>>>>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
>>>>>>> 0x00000000008d539e <+398>: mov %r15d,%edi
>>>>>>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
>>>>>>>
>>>>>>>
>>>>>>> This is, more or less, from the following code.
>>>>>>>
>>>>>>>
>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>> int oldrnd = m5_fegetround();
>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>> m5_fesetround(newrnd);
>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>> Frds = Frs1s + Frs2s;
>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>> m5_fesetround(oldrnd);
>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>>
>>>>>>>
>>>>>>> Note that the addition was moved out of the middle and fesetround was
>>>>>>> called twice back to back, once to set the new rounding mode, and once
>>>>>>> to set it right back again.
>>>>>>>
>>>>>>> Gabe
>>>>>>>
>>>>>>> On 10/28/11 08:31, Ali Saidi wrote:
>>>>>>>> I'm still not 100% convinced that this is it. I agree it's highly
>>>>>>>> likely, but it could be some other code movement or a bug in the
>>>>>>>> optimizer (we have seen them before). I wonder if you can selectively
>>>>>>>> optimize functions. Maybe a good start is compiling everything -O3
>>>>>>>> except the atomic execute function and make sure it still works.
>>>>>>>>
>>>>>>>> Ali
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <***@gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Yes, I think there exists at least one software IEEE FP
>>>>>>>>> implementation out
>>>>>>>>> there that we had talked about incorporating at some point (long ago).
>>>>>>>>> Unfortunately, as is discussed below, that's not even the issue, as we
>>>>>>>>> really want to model the not-quite-IEEE (or in the case of x87,
>>>>>>>>> not-even-close) semantics of the hardware alone, which would require
>>>>>>>>> more
>>>>>>>>> effort.
>>>>>>>>>
>>>>>>>>> If someone really cared about modeling the ISA FP support precisely that
>>>>>>>>> would be an interesting project, and if it was done cleanly (probably
>>>>>>>>> with
>>>>>>>>> the option to turn it on or off) we'd be glad to incorporate it.
>>>>>>>>>
>>>>>>>>> Ironically I think the issue here is not that the HW FP is not good
>>>>>>>>> enough
>>>>>>>>> for our purposes, it's that the software stack doesn't give us clean
>>>>>>>>> enough
>>>>>>>>> access to the HW facilities (gcc in particular, though C itself may
>>>>>>>>> share
>>>>>>>>> part of the blame).
>>>>>>>>>
>>>>>>>>> Steve
>>>>>>>>>
>>>>>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <***@eecs.umich.edu>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I think there was talk of an FP emulation library a long time ago
>>>>>>>>>> (before I was involved with M5) but we decided not to do something like
>>>>>>>>>> that for some reason. Using regular built in FP support gets us most of
>>>>>>>>>> the way with minimal hassle, but then there are situations like this
>>>>>>>>>> where it really causes trouble. I presume the prior discussion might
>>>>>>>>>> have been about whether getting most of the way there was good enough,
>>>>>>>>>> and that it's simpler.
>>>>>>>>>>
>>>>>>>>>> Gabe
>>>>>>>>>>
>>>>>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>>>>>>>>>>> ----- Original Message ----- From: "Gabe Black"
>>>>>>>>>> <***@eecs.umich.edu>
>>>>>>>>>>> To: <gem5-***@gem5.org>
>>>>>>>>>>> Sent: 25. октобар 2011 20:53
>>>>>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <***@eecs.umich.edu>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>> [snip]
>>>>>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion rather
>>>>>>>>>> than a
>>>>>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86. The
>>>>>>>>>> default
>>>>>>>>>>>> rounding mode *is* standard, though, and I don't think is
>>>>>>>>>> adjusted in
>>>>>>>>>>>> SPARC as a result of execution. If it changed somehow (unless I'm
>>>>>>>>>>>> forgetting where SPARC does that) it's a fairly significant problem.
>>>>>>>>>>>> Whether instructions generate +/- 0 in various situations may
>>>>>>>>>> depend on,
>>>>>>>>>>>> for instance, what order gcc decides to put the operands. I'm not
>>>>>>>>>> sure
>>>>>>>>>>>> that it does, but there are all kinds of weird, subtle behaviors
>>>>>>>>>> with
>>>>>>>>>>>> FP, and you can't just fix how add works if x86 picked the wrong
>>>>>>>>>> thing.
>>>>>>>>>>>> Then you have to replace add, or semi-replace it by faking it out
>>>>>>>>>> with
>>>>>>>>>>>> other FP operations. If we're running real x87 instructions (we
>>>>>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those use
>>>>>>>>>> 80 bit
>>>>>>>>>>>> operands internally. Where and when rounding takes place depends
>>>>>>>>>> on when
>>>>>>>>>>>> those are moved in/out of the FPU, and will be different than
>>>>>>>>>> true 64
>>>>>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that should
>>>>>>>>>>>> behave better. It should also be the default in 64 bit mode since
>>>>>>>>>> the
>>>>>>>>>>>> compiler can assume some basic SSE support is present.
>>>>>>>>>>>>
>>>>>>>>>>> What about FP emulation using integers and some kind of multiple
>>>>>>>>>>> precision
>>>>>>>>>>> arithmetic? Then every detail could be modeled, including x87
>>>>>>>>>> "floats"
>>>>>>>>>>> and
>>>>>>>>>>> "doubles" (in registers exponent field is still 15 bits, not 8/11 and
>>>>>>>>>>> makes
>>>>>>>>>>> mess of overflow/underflow, or it will go in memory and will be
>>>>>>>>>> proper
>>>>>>>>>>> float/double). Gcc has some switches regarding that behavior but
>>>>>>>>>> that is
>>>>>>>>>>> very fragile (more like suggestion to compiler then enforcing
>>>>>>>>>> option).
>>>>>>>>>>> Double rounding in x87 is special story because double extended
>>>>>>>>>>> mantissa is not more than twice longer then one for double so double
>>>>>>>>>>> rounding can give different results compared to single rounding (this
>>>>>>>>>>> situation can't happen
>>>>>>>>>>> with float vs double). One solution, for example: splitting mantissas
>>>>>>>>>>> into to halves and performing operation, all bits would be available
>>>>>>>>>>> and then proper any kind of rounding could be enforced (real ieee or
>>>>>>>>>>> "isa style ieee"). Performing those operations is not very slow
>>>>>>>>>> and it
>>>>>>>>>>> is fairly ILP reach so slowdown is not that great as when pure number
>>>>>>>>>>> of instructions is compared (although to have robust code, cpu and
>>>>>>>>>>> compiler independence, specially about "optimizing code" some tests
>>>>>>>>>>> are needed to eradicate subnormals due poor support/trap emulation).
>>>>>>>>>>> Plus if instructions are mixed in right way both int and fpu units
>>>>>>>>>> can
>>>>>>>>>>> be kept busy. Exponent can be one short and problem solved. Only
>>>>>>>>>>> division can be somewhattricky (and slow), but it can be done too.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>> Even if the FP rounding error isn't the source of the problem,
>>>>>>>>>> it might
>>>>>>>>>>>>> be
>>>>>>>>>>>>> easiest to fix that and get it out of the way so we can see what
>>>>>>>>>> the
>>>>>>>>>>>>> actual
>>>>>>>>>>>>> problem is.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you really want to know *why* the kernel is doing all this
>>>>>>>>>> FP, then
>>>>>>>>>>>>> yes,
>>>>>>>>>>>>> you probably need to look at the source code.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Steve
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-dev mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-10-30 00:36:01 UTC
Permalink
Bleah, this is ugly! Reading that one bug report Gabe linked to, it sounds
like -frounding-math is supposed to make this work, but it's not correctly
implemented, and as a result there's really no straightforward way to make
this work. I think that should be documented somewhere so that one day, if
-frounding-math does get implemented properly, we can start relying on it
and not on whatever hack we come up with.

Another idea, assuming m5_fesetround() isn't inlined, would be to have it
accept a double argument that it just passes back unmodified. Then you
could do something like:

Frs1s = m5_fesetround(newrnd, Frs1s);
Frds = Frs1s + Frs2s;
Frds = m5_fesetround(oldrnd, Frds);

Would that work?

Steve

On Sat, Oct 29, 2011 at 4:51 PM, Gabe Black <***@eecs.umich.edu> wrote:

> I don't think either will work because it's not the optimizations in
> those functions or the functions order relative to each other or the
> asms, it's the position of the add relative to the asms. Since the add
> can move around wherever, it doesn't matter if the calls to fesetround
> are bounded by the asms. We could potentially mark the execute function
> with a different optimization level though. That might work. Also, I
> have that filterDoubles function in there that finds fp operands that
> are doubles and builds them from or breaks them down into single floats.
> We could possibly piggyback on that to add in asms with the right
> properties like in ARM. It's a bit gross, but like you said I don't know
> if we can avoid that.
>
> Gabe
>
> On 10/29/11 16:31, Ali Saidi wrote:
> > If we go down the path below, slighly less hacky might be just making
> gcc compiler the entire fenv file without optimization, although perhaps
> that is insufficient....
> >
> > Ali
> >
> > On Oct 29, 2011, at 6:30 PM, Ali Saidi wrote:
> >
> >> What about making m5_fesetround and m5_fegetround() modify memory and
> thus prevent reordering?
> >>
> >> Something like:
> >>
> >> volatile int dummy_compiler;
> >>
> >> void m5_fesetround(int rm)
> >> {
> >> assert(rm >= 0 && rm < 4);
> >> dummy_compiler++;
> >> fesetround(m5_round_ops[rm]);
> >> dummy_compiler++;
> >> }
> >>
> >> int m5_fegetround()
> >> {
> >> int x;
> >> dummy_compiler++;
> >> int rm = fegetround();
> >> dummy_compiler++;
> >> for(x = 0; x < 4; x++)
> >> if (m5_round_ops[x] == rm)
> >> return x;
> >> abort();
> >> return 0;
> >> }
> >>
> >> Would that just fix it? Mabye m5_round_ops and rm could be made
> volatile instead?
> >>
> >> Another possible solution and hack, but I think we're into hack
> territory no matter what since gcc seems brain damaged in this regard:
> >>
> >> #if __GNUC__ > 3 && __GNUC_MINOR__ > 3 // 4.4 or newer
> >> #pragma GCC push_options
> >> #pragma GCC optimize ("O0")
> >>
> >> // m5_fe* goes here
> >>
> >> #pragma GCC pop_options
> >> #endif
> >>
> >>
> >> A third option would be something like
> >>
> >> void __attribute__((optimize("O0")) m5_fesetround(int rm)...
> >>
> >> Ali
> >>
> >>
> >> On Oct 29, 2011, at 4:59 PM, Gabe Black wrote:
> >>
> >>> http://permalink.gmane.org/gmane.comp.gcc.help/38146
> >>>
> >>> On 10/29/11 14:21, Gabe Black wrote:
> >>>> Yes, it doesn't work either. What makes the ARM asm statements work is
> >>>> that they have input and output arguments. That ties them into the
> data
> >>>> flow graph having to do with those values, and they act as anchors,
> >>>> forcing values to be produced by the time you get to the asm and not
> to
> >>>> be consumed before it. Here we're just saying not to trust memory from
> >>>> before the asm, and since it's not *in* memory, the compiler merrily
> >>>> ignores us. I had this problem with ARM initially too until I added
> the
> >>>> arguments. I've tried making floating point variables volatile to
> ensure
> >>>> they're in memory, and that doesn't work either. I think the actual
> >>>> semantics of volatile are a little different than what most people
> >>>> assume, although I don't remember what the distinction is. One option
> >>>> might be to make the FP operation itself a virtual function. Then gcc
> >>>> won't know what it does and will be less able to break things by
> moving
> >>>> things around.
> >>>>
> >>>> It seems like a pretty severe deficiency of gcc that there's no way to
> >>>> make fesetround work properly. It becomes nearly worthless because you
> >>>> can't make any assumptions about when it will actually be in effect.
> >>>> That's what we have to work with, though.
> >>>>
> >>>> Gabe
> >>>>
> >>>> On 10/29/11 13:53, Ali Saidi wrote:
> >>>>> I was just about to send a message about -frounding-math when I saw
> yours. Interesting that the asm barriers appears to work with ARM. It feels
> like there should be an explicit code motion barrier. Anyway, have we tried
> compiling with the -frounding-math flag?
> >>>>>
> >>>>>
> >>>>>
> >>>>> Ali
> >>>>>
> >>>>> Sent from my ARM powered device
> >>>>>
> >>>>> On Oct 29, 2011, at 3:44 PM, Gabe Black <***@eecs.umich.edu>
> wrote:
> >>>>>
> >>>>>> Here's a discussion on the gcc mailing list of the thing I was
> talking
> >>>>>> about before that's supposed to fix this, I think.
> >>>>>>
> >>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
> >>>>>>
> >>>>>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all
> be
> >>>>>> registers.
> >>>>>>
> >>>>>> Gabe
> >>>>>>
> >>>>>> On 10/29/11 13:31, Gabe Black wrote:
> >>>>>>> Here is some suspect assembly from Fadds for the atomic simple CPU
> >>>>>>>
> >>>>>>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
> >>>>>>> 0x00000000008d5393 <+387>: mov %eax,%r15d
> >>>>>>> 0x00000000008d5396 <+390>: mov %r14d,%edi
> >>>>>>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
> >>>>>>> 0x00000000008d539e <+398>: mov %r15d,%edi
> >>>>>>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
> >>>>>>>
> >>>>>>>
> >>>>>>> This is, more or less, from the following code.
> >>>>>>>
> >>>>>>>
> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
> >>>>>>> int oldrnd = m5_fegetround();
> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
> >>>>>>> m5_fesetround(newrnd);
> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
> >>>>>>> Frds = Frs1s + Frs2s;
> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
> >>>>>>> m5_fesetround(oldrnd);
> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
> >>>>>>>
> >>>>>>>
> >>>>>>> Note that the addition was moved out of the middle and fesetround
> was
> >>>>>>> called twice back to back, once to set the new rounding mode, and
> once
> >>>>>>> to set it right back again.
> >>>>>>>
> >>>>>>> Gabe
> >>>>>>>
> >>>>>>> On 10/28/11 08:31, Ali Saidi wrote:
> >>>>>>>> I'm still not 100% convinced that this is it. I agree it's highly
> >>>>>>>> likely, but it could be some other code movement or a bug in the
> >>>>>>>> optimizer (we have seen them before). I wonder if you can
> selectively
> >>>>>>>> optimize functions. Maybe a good start is compiling everything -O3
> >>>>>>>> except the atomic execute function and make sure it still works.
> >>>>>>>>
> >>>>>>>> Ali
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <
> ***@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>> Yes, I think there exists at least one software IEEE FP
> >>>>>>>>> implementation out
> >>>>>>>>> there that we had talked about incorporating at some point (long
> ago).
> >>>>>>>>> Unfortunately, as is discussed below, that's not even the issue,
> as we
> >>>>>>>>> really want to model the not-quite-IEEE (or in the case of x87,
> >>>>>>>>> not-even-close) semantics of the hardware alone, which would
> require
> >>>>>>>>> more
> >>>>>>>>> effort.
> >>>>>>>>>
> >>>>>>>>> If someone really cared about modeling the ISA FP support
> precisely that
> >>>>>>>>> would be an interesting project, and if it was done cleanly
> (probably
> >>>>>>>>> with
> >>>>>>>>> the option to turn it on or off) we'd be glad to incorporate it.
> >>>>>>>>>
> >>>>>>>>> Ironically I think the issue here is not that the HW FP is not
> good
> >>>>>>>>> enough
> >>>>>>>>> for our purposes, it's that the software stack doesn't give us
> clean
> >>>>>>>>> enough
> >>>>>>>>> access to the HW facilities (gcc in particular, though C itself
> may
> >>>>>>>>> share
> >>>>>>>>> part of the blame).
> >>>>>>>>>
> >>>>>>>>> Steve
> >>>>>>>>>
> >>>>>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <
> ***@eecs.umich.edu>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I think there was talk of an FP emulation library a long time
> ago
> >>>>>>>>>> (before I was involved with M5) but we decided not to do
> something like
> >>>>>>>>>> that for some reason. Using regular built in FP support gets us
> most of
> >>>>>>>>>> the way with minimal hassle, but then there are situations like
> this
> >>>>>>>>>> where it really causes trouble. I presume the prior discussion
> might
> >>>>>>>>>> have been about whether getting most of the way there was good
> enough,
> >>>>>>>>>> and that it's simpler.
> >>>>>>>>>>
> >>>>>>>>>> Gabe
> >>>>>>>>>>
> >>>>>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
> >>>>>>>>>>> ----- Original Message ----- From: "Gabe Black"
> >>>>>>>>>> <***@eecs.umich.edu>
> >>>>>>>>>>> To: <gem5-***@gem5.org>
> >>>>>>>>>>> Sent: 25. октобар 2011 20:53
> >>>>>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
> >>>>>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <
> ***@eecs.umich.edu>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>> [snip]
> >>>>>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion
> rather
> >>>>>>>>>> than a
> >>>>>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86.
> The
> >>>>>>>>>> default
> >>>>>>>>>>>> rounding mode *is* standard, though, and I don't think is
> >>>>>>>>>> adjusted in
> >>>>>>>>>>>> SPARC as a result of execution. If it changed somehow (unless
> I'm
> >>>>>>>>>>>> forgetting where SPARC does that) it's a fairly significant
> problem.
> >>>>>>>>>>>> Whether instructions generate +/- 0 in various situations may
> >>>>>>>>>> depend on,
> >>>>>>>>>>>> for instance, what order gcc decides to put the operands. I'm
> not
> >>>>>>>>>> sure
> >>>>>>>>>>>> that it does, but there are all kinds of weird, subtle
> behaviors
> >>>>>>>>>> with
> >>>>>>>>>>>> FP, and you can't just fix how add works if x86 picked the
> wrong
> >>>>>>>>>> thing.
> >>>>>>>>>>>> Then you have to replace add, or semi-replace it by faking it
> out
> >>>>>>>>>> with
> >>>>>>>>>>>> other FP operations. If we're running real x87 instructions
> (we
> >>>>>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those
> use
> >>>>>>>>>> 80 bit
> >>>>>>>>>>>> operands internally. Where and when rounding takes place
> depends
> >>>>>>>>>> on when
> >>>>>>>>>>>> those are moved in/out of the FPU, and will be different than
> >>>>>>>>>> true 64
> >>>>>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that
> should
> >>>>>>>>>>>> behave better. It should also be the default in 64 bit mode
> since
> >>>>>>>>>> the
> >>>>>>>>>>>> compiler can assume some basic SSE support is present.
> >>>>>>>>>>>>
> >>>>>>>>>>> What about FP emulation using integers and some kind of
> multiple
> >>>>>>>>>>> precision
> >>>>>>>>>>> arithmetic? Then every detail could be modeled, including x87
> >>>>>>>>>> "floats"
> >>>>>>>>>>> and
> >>>>>>>>>>> "doubles" (in registers exponent field is still 15 bits, not
> 8/11 and
> >>>>>>>>>>> makes
> >>>>>>>>>>> mess of overflow/underflow, or it will go in memory and will be
> >>>>>>>>>> proper
> >>>>>>>>>>> float/double). Gcc has some switches regarding that behavior
> but
> >>>>>>>>>> that is
> >>>>>>>>>>> very fragile (more like suggestion to compiler then enforcing
> >>>>>>>>>> option).
> >>>>>>>>>>> Double rounding in x87 is special story because double extended
> >>>>>>>>>>> mantissa is not more than twice longer then one for double so
> double
> >>>>>>>>>>> rounding can give different results compared to single
> rounding (this
> >>>>>>>>>>> situation can't happen
> >>>>>>>>>>> with float vs double). One solution, for example: splitting
> mantissas
> >>>>>>>>>>> into to halves and performing operation, all bits would be
> available
> >>>>>>>>>>> and then proper any kind of rounding could be enforced (real
> ieee or
> >>>>>>>>>>> "isa style ieee"). Performing those operations is not very slow
> >>>>>>>>>> and it
> >>>>>>>>>>> is fairly ILP reach so slowdown is not that great as when pure
> number
> >>>>>>>>>>> of instructions is compared (although to have robust code, cpu
> and
> >>>>>>>>>>> compiler independence, specially about "optimizing code" some
> tests
> >>>>>>>>>>> are needed to eradicate subnormals due poor support/trap
> emulation).
> >>>>>>>>>>> Plus if instructions are mixed in right way both int and fpu
> units
> >>>>>>>>>> can
> >>>>>>>>>>> be kept busy. Exponent can be one short and problem solved.
> Only
> >>>>>>>>>>> division can be somewhattricky (and slow), but it can be done
> too.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>> Even if the FP rounding error isn't the source of the
> problem,
> >>>>>>>>>> it might
> >>>>>>>>>>>>> be
> >>>>>>>>>>>>> easiest to fix that and get it out of the way so we can see
> what
> >>>>>>>>>> the
> >>>>>>>>>>>>> actual
> >>>>>>>>>>>>> problem is.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> If you really want to know *why* the kernel is doing all this
> >>>>>>>>>> FP, then
> >>>>>>>>>>>>> yes,
> >>>>>>>>>>>>> you probably need to look at the source code.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Steve
> >>>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>>> gem5-dev mailing list
> >>>>>>>>>>>>> gem5-***@gem5.org
> >>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>>> _______________________________________________
> >>>>>>>>>>>> gem5-dev mailing list
> >>>>>>>>>>>> gem5-***@gem5.org
> >>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> gem5-dev mailing list
> >>>>>>>>>>> gem5-***@gem5.org
> >>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> gem5-dev mailing list
> >>>>>>>>>> gem5-***@gem5.org
> >>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>
> >>>>>>>>> _______________________________________________
> >>>>>>>>> gem5-dev mailing list
> >>>>>>>>> gem5-***@gem5.org
> >>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>> _______________________________________________
> >>>>>>>> gem5-dev mailing list
> >>>>>>>> gem5-***@gem5.org
> >>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>> _______________________________________________
> >>>>>>> gem5-dev mailing list
> >>>>>>> gem5-***@gem5.org
> >>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>> _______________________________________________
> >>>>>> gem5-dev mailing list
> >>>>>> gem5-***@gem5.org
> >>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>> _______________________________________________
> >>>>> gem5-dev mailing list
> >>>>> gem5-***@gem5.org
> >>>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> _______________________________________________
> >>>> gem5-dev mailing list
> >>>> gem5-***@gem5.org
> >>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>> _______________________________________________
> >>> gem5-dev mailing list
> >>> gem5-***@gem5.org
> >>> http://m5sim.org/mailman/listinfo/gem5-dev
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-***@gem5.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-***@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabriel Michael Black
2011-10-30 02:09:21 UTC
Permalink
That sounds a bit like what I did with the asm blocks in ARM, except
that it would modify that function and actually do that call. We'd
also need to have single and double versions even though the parameter
isn't used. This is the sort of thing I'm talking about from ARM.

m5_fesetround(newrnd)
__asm__ __volatile__ ("" : "=m" (Frs1s) : "m" (Frs1s));
__asm__ __volatile__ ("" : "=m" (Frs2s) : "m" (Frs2s));
Frds = Frs1s + Frs2s;
__asm__ __volatile__ ("" : "=m" (Frds) : "m" (Frds));
m5_fesetround(oldrnd)

The gcc is obligated to use the values of Frs1s and Frs2s "returned"
by the first two asm blocks, and it's obligated to pass the result as
a parameter to the third asm block. Those constraints pinch it in the
middle and force the operation to fall inside the m5_fesetround calls.

This works in ARM, but here it's a little more cumbersome since SPARC
has been doing the rounding stuff in a generic way without knowledge
(more or less) of what the operands are. The filterDoubles thing is
already set up to look for operands, so it's not totally a new idea.

One thing I just thought of is that I'm not completely sure that gcc
will leave those variables in the same place when they're used for
inputs and outputs. Maybe it expects the asm to move the input to the
output even if it doesn't do anything else? Note that while they
*look* like the same variable, there's nothing (that I know of) that
requires gcc to make that name refer to the same storage all the time,
just the same value. There's syntax to tell it to use particular
output as an input too (or the other way around?), and that may make
this sort of thing less of an issue. I have no good reason to think
there's actually a problem here, but hypothetically it could be yet
another problem with playing these sorts of games.

Gabe

Quoting Steve Reinhardt <***@gmail.com>:

> Bleah, this is ugly! Reading that one bug report Gabe linked to, it sounds
> like -frounding-math is supposed to make this work, but it's not correctly
> implemented, and as a result there's really no straightforward way to make
> this work. I think that should be documented somewhere so that one day, if
> -frounding-math does get implemented properly, we can start relying on it
> and not on whatever hack we come up with.
>
> Another idea, assuming m5_fesetround() isn't inlined, would be to have it
> accept a double argument that it just passes back unmodified. Then you
> could do something like:
>
> Frs1s = m5_fesetround(newrnd, Frs1s);
> Frds = Frs1s + Frs2s;
> Frds = m5_fesetround(oldrnd, Frds);
>
> Would that work?
>
> Steve
>
> On Sat, Oct 29, 2011 at 4:51 PM, Gabe Black <***@eecs.umich.edu> wrote:
>
>> I don't think either will work because it's not the optimizations in
>> those functions or the functions order relative to each other or the
>> asms, it's the position of the add relative to the asms. Since the add
>> can move around wherever, it doesn't matter if the calls to fesetround
>> are bounded by the asms. We could potentially mark the execute function
>> with a different optimization level though. That might work. Also, I
>> have that filterDoubles function in there that finds fp operands that
>> are doubles and builds them from or breaks them down into single floats.
>> We could possibly piggyback on that to add in asms with the right
>> properties like in ARM. It's a bit gross, but like you said I don't know
>> if we can avoid that.
>>
>> Gabe
>>
>> On 10/29/11 16:31, Ali Saidi wrote:
>> > If we go down the path below, slighly less hacky might be just making
>> gcc compiler the entire fenv file without optimization, although perhaps
>> that is insufficient....
>> >
>> > Ali
>> >
>> > On Oct 29, 2011, at 6:30 PM, Ali Saidi wrote:
>> >
>> >> What about making m5_fesetround and m5_fegetround() modify memory and
>> thus prevent reordering?
>> >>
>> >> Something like:
>> >>
>> >> volatile int dummy_compiler;
>> >>
>> >> void m5_fesetround(int rm)
>> >> {
>> >> assert(rm >= 0 && rm < 4);
>> >> dummy_compiler++;
>> >> fesetround(m5_round_ops[rm]);
>> >> dummy_compiler++;
>> >> }
>> >>
>> >> int m5_fegetround()
>> >> {
>> >> int x;
>> >> dummy_compiler++;
>> >> int rm = fegetround();
>> >> dummy_compiler++;
>> >> for(x = 0; x < 4; x++)
>> >> if (m5_round_ops[x] == rm)
>> >> return x;
>> >> abort();
>> >> return 0;
>> >> }
>> >>
>> >> Would that just fix it? Mabye m5_round_ops and rm could be made
>> volatile instead?
>> >>
>> >> Another possible solution and hack, but I think we're into hack
>> territory no matter what since gcc seems brain damaged in this regard:
>> >>
>> >> #if __GNUC__ > 3 && __GNUC_MINOR__ > 3 // 4.4 or newer
>> >> #pragma GCC push_options
>> >> #pragma GCC optimize ("O0")
>> >>
>> >> // m5_fe* goes here
>> >>
>> >> #pragma GCC pop_options
>> >> #endif
>> >>
>> >>
>> >> A third option would be something like
>> >>
>> >> void __attribute__((optimize("O0")) m5_fesetround(int rm)...
>> >>
>> >> Ali
>> >>
>> >>
>> >> On Oct 29, 2011, at 4:59 PM, Gabe Black wrote:
>> >>
>> >>> http://permalink.gmane.org/gmane.comp.gcc.help/38146
>> >>>
>> >>> On 10/29/11 14:21, Gabe Black wrote:
>> >>>> Yes, it doesn't work either. What makes the ARM asm statements work is
>> >>>> that they have input and output arguments. That ties them into the
>> data
>> >>>> flow graph having to do with those values, and they act as anchors,
>> >>>> forcing values to be produced by the time you get to the asm and not
>> to
>> >>>> be consumed before it. Here we're just saying not to trust memory from
>> >>>> before the asm, and since it's not *in* memory, the compiler merrily
>> >>>> ignores us. I had this problem with ARM initially too until I added
>> the
>> >>>> arguments. I've tried making floating point variables volatile to
>> ensure
>> >>>> they're in memory, and that doesn't work either. I think the actual
>> >>>> semantics of volatile are a little different than what most people
>> >>>> assume, although I don't remember what the distinction is. One option
>> >>>> might be to make the FP operation itself a virtual function. Then gcc
>> >>>> won't know what it does and will be less able to break things by
>> moving
>> >>>> things around.
>> >>>>
>> >>>> It seems like a pretty severe deficiency of gcc that there's no way to
>> >>>> make fesetround work properly. It becomes nearly worthless because you
>> >>>> can't make any assumptions about when it will actually be in effect.
>> >>>> That's what we have to work with, though.
>> >>>>
>> >>>> Gabe
>> >>>>
>> >>>> On 10/29/11 13:53, Ali Saidi wrote:
>> >>>>> I was just about to send a message about -frounding-math when I saw
>> yours. Interesting that the asm barriers appears to work with ARM. It feels
>> like there should be an explicit code motion barrier. Anyway, have we tried
>> compiling with the -frounding-math flag?
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> Ali
>> >>>>>
>> >>>>> Sent from my ARM powered device
>> >>>>>
>> >>>>> On Oct 29, 2011, at 3:44 PM, Gabe Black <***@eecs.umich.edu>
>> wrote:
>> >>>>>
>> >>>>>> Here's a discussion on the gcc mailing list of the thing I was
>> talking
>> >>>>>> about before that's supposed to fix this, I think.
>> >>>>>>
>> >>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
>> >>>>>>
>> >>>>>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all
>> be
>> >>>>>> registers.
>> >>>>>>
>> >>>>>> Gabe
>> >>>>>>
>> >>>>>> On 10/29/11 13:31, Gabe Black wrote:
>> >>>>>>> Here is some suspect assembly from Fadds for the atomic simple CPU
>> >>>>>>>
>> >>>>>>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
>> >>>>>>> 0x00000000008d5393 <+387>: mov %eax,%r15d
>> >>>>>>> 0x00000000008d5396 <+390>: mov %r14d,%edi
>> >>>>>>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
>> >>>>>>> 0x00000000008d539e <+398>: mov %r15d,%edi
>> >>>>>>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> This is, more or less, from the following code.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
>> >>>>>>> int oldrnd = m5_fegetround();
>> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
>> >>>>>>> m5_fesetround(newrnd);
>> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
>> >>>>>>> Frds = Frs1s + Frs2s;
>> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
>> >>>>>>> m5_fesetround(oldrnd);
>> >>>>>>> __asm__ __volatile__ ("" ::: "memory");
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Note that the addition was moved out of the middle and fesetround
>> was
>> >>>>>>> called twice back to back, once to set the new rounding mode, and
>> once
>> >>>>>>> to set it right back again.
>> >>>>>>>
>> >>>>>>> Gabe
>> >>>>>>>
>> >>>>>>> On 10/28/11 08:31, Ali Saidi wrote:
>> >>>>>>>> I'm still not 100% convinced that this is it. I agree it's highly
>> >>>>>>>> likely, but it could be some other code movement or a bug in the
>> >>>>>>>> optimizer (we have seen them before). I wonder if you can
>> selectively
>> >>>>>>>> optimize functions. Maybe a good start is compiling everything -O3
>> >>>>>>>> except the atomic execute function and make sure it still works.
>> >>>>>>>>
>> >>>>>>>> Ali
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <
>> ***@gmail.com>
>> >>>>>>>> wrote:
>> >>>>>>>>> Yes, I think there exists at least one software IEEE FP
>> >>>>>>>>> implementation out
>> >>>>>>>>> there that we had talked about incorporating at some point (long
>> ago).
>> >>>>>>>>> Unfortunately, as is discussed below, that's not even the issue,
>> as we
>> >>>>>>>>> really want to model the not-quite-IEEE (or in the case of x87,
>> >>>>>>>>> not-even-close) semantics of the hardware alone, which would
>> require
>> >>>>>>>>> more
>> >>>>>>>>> effort.
>> >>>>>>>>>
>> >>>>>>>>> If someone really cared about modeling the ISA FP support
>> precisely that
>> >>>>>>>>> would be an interesting project, and if it was done cleanly
>> (probably
>> >>>>>>>>> with
>> >>>>>>>>> the option to turn it on or off) we'd be glad to incorporate it.
>> >>>>>>>>>
>> >>>>>>>>> Ironically I think the issue here is not that the HW FP is not
>> good
>> >>>>>>>>> enough
>> >>>>>>>>> for our purposes, it's that the software stack doesn't give us
>> clean
>> >>>>>>>>> enough
>> >>>>>>>>> access to the HW facilities (gcc in particular, though C itself
>> may
>> >>>>>>>>> share
>> >>>>>>>>> part of the blame).
>> >>>>>>>>>
>> >>>>>>>>> Steve
>> >>>>>>>>>
>> >>>>>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <
>> ***@eecs.umich.edu>
>> >>>>>>>>> wrote:
>> >>>>>>>>>
>> >>>>>>>>>> I think there was talk of an FP emulation library a long time
>> ago
>> >>>>>>>>>> (before I was involved with M5) but we decided not to do
>> something like
>> >>>>>>>>>> that for some reason. Using regular built in FP support gets us
>> most of
>> >>>>>>>>>> the way with minimal hassle, but then there are situations like
>> this
>> >>>>>>>>>> where it really causes trouble. I presume the prior discussion
>> might
>> >>>>>>>>>> have been about whether getting most of the way there was good
>> enough,
>> >>>>>>>>>> and that it's simpler.
>> >>>>>>>>>>
>> >>>>>>>>>> Gabe
>> >>>>>>>>>>
>> >>>>>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>> >>>>>>>>>>> ----- Original Message ----- From: "Gabe Black"
>> >>>>>>>>>> <***@eecs.umich.edu>
>> >>>>>>>>>>> To: <gem5-***@gem5.org>
>> >>>>>>>>>>> Sent: 25. октобар 2011 20:53
>> >>>>>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>> >>>>>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <
>> ***@eecs.umich.edu>
>> >>>>>>>>>>>>> wrote:
>> >>>>>>>>>>> [snip]
>> >>>>>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion
>> rather
>> >>>>>>>>>> than a
>> >>>>>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86.
>> The
>> >>>>>>>>>> default
>> >>>>>>>>>>>> rounding mode *is* standard, though, and I don't think is
>> >>>>>>>>>> adjusted in
>> >>>>>>>>>>>> SPARC as a result of execution. If it changed somehow (unless
>> I'm
>> >>>>>>>>>>>> forgetting where SPARC does that) it's a fairly significant
>> problem.
>> >>>>>>>>>>>> Whether instructions generate +/- 0 in various situations may
>> >>>>>>>>>> depend on,
>> >>>>>>>>>>>> for instance, what order gcc decides to put the operands. I'm
>> not
>> >>>>>>>>>> sure
>> >>>>>>>>>>>> that it does, but there are all kinds of weird, subtle
>> behaviors
>> >>>>>>>>>> with
>> >>>>>>>>>>>> FP, and you can't just fix how add works if x86 picked the
>> wrong
>> >>>>>>>>>> thing.
>> >>>>>>>>>>>> Then you have to replace add, or semi-replace it by faking it
>> out
>> >>>>>>>>>> with
>> >>>>>>>>>>>> other FP operations. If we're running real x87 instructions
>> (we
>> >>>>>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those
>> use
>> >>>>>>>>>> 80 bit
>> >>>>>>>>>>>> operands internally. Where and when rounding takes place
>> depends
>> >>>>>>>>>> on when
>> >>>>>>>>>>>> those are moved in/out of the FPU, and will be different than
>> >>>>>>>>>> true 64
>> >>>>>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that
>> should
>> >>>>>>>>>>>> behave better. It should also be the default in 64 bit mode
>> since
>> >>>>>>>>>> the
>> >>>>>>>>>>>> compiler can assume some basic SSE support is present.
>> >>>>>>>>>>>>
>> >>>>>>>>>>> What about FP emulation using integers and some kind of
>> multiple
>> >>>>>>>>>>> precision
>> >>>>>>>>>>> arithmetic? Then every detail could be modeled, including x87
>> >>>>>>>>>> "floats"
>> >>>>>>>>>>> and
>> >>>>>>>>>>> "doubles" (in registers exponent field is still 15 bits, not
>> 8/11 and
>> >>>>>>>>>>> makes
>> >>>>>>>>>>> mess of overflow/underflow, or it will go in memory and will be
>> >>>>>>>>>> proper
>> >>>>>>>>>>> float/double). Gcc has some switches regarding that behavior
>> but
>> >>>>>>>>>> that is
>> >>>>>>>>>>> very fragile (more like suggestion to compiler then enforcing
>> >>>>>>>>>> option).
>> >>>>>>>>>>> Double rounding in x87 is special story because double extended
>> >>>>>>>>>>> mantissa is not more than twice longer then one for double so
>> double
>> >>>>>>>>>>> rounding can give different results compared to single
>> rounding (this
>> >>>>>>>>>>> situation can't happen
>> >>>>>>>>>>> with float vs double). One solution, for example: splitting
>> mantissas
>> >>>>>>>>>>> into to halves and performing operation, all bits would be
>> available
>> >>>>>>>>>>> and then proper any kind of rounding could be enforced (real
>> ieee or
>> >>>>>>>>>>> "isa style ieee"). Performing those operations is not very slow
>> >>>>>>>>>> and it
>> >>>>>>>>>>> is fairly ILP reach so slowdown is not that great as when pure
>> number
>> >>>>>>>>>>> of instructions is compared (although to have robust code, cpu
>> and
>> >>>>>>>>>>> compiler independence, specially about "optimizing code" some
>> tests
>> >>>>>>>>>>> are needed to eradicate subnormals due poor support/trap
>> emulation).
>> >>>>>>>>>>> Plus if instructions are mixed in right way both int and fpu
>> units
>> >>>>>>>>>> can
>> >>>>>>>>>>> be kept busy. Exponent can be one short and problem solved.
>> Only
>> >>>>>>>>>>> division can be somewhattricky (and slow), but it can be done
>> too.
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>>> Even if the FP rounding error isn't the source of the
>> problem,
>> >>>>>>>>>> it might
>> >>>>>>>>>>>>> be
>> >>>>>>>>>>>>> easiest to fix that and get it out of the way so we can see
>> what
>> >>>>>>>>>> the
>> >>>>>>>>>>>>> actual
>> >>>>>>>>>>>>> problem is.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> If you really want to know *why* the kernel is doing all this
>> >>>>>>>>>> FP, then
>> >>>>>>>>>>>>> yes,
>> >>>>>>>>>>>>> you probably need to look at the source code.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> Steve
>> >>>>>>>>>>>>> _______________________________________________
>> >>>>>>>>>>>>> gem5-dev mailing list
>> >>>>>>>>>>>>> gem5-***@gem5.org
>> >>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>>>> _______________________________________________
>> >>>>>>>>>>>> gem5-dev mailing list
>> >>>>>>>>>>>> gem5-***@gem5.org
>> >>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>>>>
>> >>>>>>>>>>> _______________________________________________
>> >>>>>>>>>>> gem5-dev mailing list
>> >>>>>>>>>>> gem5-***@gem5.org
>> >>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>> _______________________________________________
>> >>>>>>>>>> gem5-dev mailing list
>> >>>>>>>>>> gem5-***@gem5.org
>> >>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>>
>> >>>>>>>>> _______________________________________________
>> >>>>>>>>> gem5-dev mailing list
>> >>>>>>>>> gem5-***@gem5.org
>> >>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>> _______________________________________________
>> >>>>>>>> gem5-dev mailing list
>> >>>>>>>> gem5-***@gem5.org
>> >>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>> _______________________________________________
>> >>>>>>> gem5-dev mailing list
>> >>>>>>> gem5-***@gem5.org
>> >>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>> _______________________________________________
>> >>>>>> gem5-dev mailing list
>> >>>>>> gem5-***@gem5.org
>> >>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>> _______________________________________________
>> >>>>> gem5-dev mailing list
>> >>>>> gem5-***@gem5.org
>> >>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>> _______________________________________________
>> >>>> gem5-dev mailing list
>> >>>> gem5-***@gem5.org
>> >>>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >>> _______________________________________________
>> >>> gem5-dev mailing list
>> >>> gem5-***@gem5.org
>> >>> http://m5sim.org/mailman/listinfo/gem5-dev
>> >> _______________________________________________
>> >> gem5-dev mailing list
>> >> gem5-***@gem5.org
>> >> http://m5sim.org/mailman/listinfo/gem5-dev
>> > _______________________________________________
>> > gem5-dev mailing list
>> > gem5-***@gem5.org
>> > http://m5sim.org/mailman/listinfo/gem5-dev
>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabriel Michael Black
2011-10-30 02:18:12 UTC
Permalink
Even this isn't fool proof, though. Theoretically here's nothing to
prevent gcc from doing the same calculation twice, once inside the asm
blobs and once outside. It may assume the values are the same and use
the wrong one as the result. That -frounding-math would prevent that,
I think, since it prevents gcc from assuming the rounding mode is
always the same. It would be weird for gcc to purposefully do the same
calculation twice, but it would be within its rights to do so.

Gabe

Quoting Gabriel Michael Black <***@eecs.umich.edu>:

> That sounds a bit like what I did with the asm blocks in ARM, except
> that it would modify that function and actually do that call. We'd
> also need to have single and double versions even though the
> parameter isn't used. This is the sort of thing I'm talking about
> from ARM.
>
> m5_fesetround(newrnd)
> __asm__ __volatile__ ("" : "=m" (Frs1s) : "m" (Frs1s));
> __asm__ __volatile__ ("" : "=m" (Frs2s) : "m" (Frs2s));
> Frds = Frs1s + Frs2s;
> __asm__ __volatile__ ("" : "=m" (Frds) : "m" (Frds));
> m5_fesetround(oldrnd)
>
> The gcc is obligated to use the values of Frs1s and Frs2s "returned"
> by the first two asm blocks, and it's obligated to pass the result
> as a parameter to the third asm block. Those constraints pinch it in
> the middle and force the operation to fall inside the m5_fesetround
> calls.
>
> This works in ARM, but here it's a little more cumbersome since
> SPARC has been doing the rounding stuff in a generic way without
> knowledge (more or less) of what the operands are. The filterDoubles
> thing is already set up to look for operands, so it's not totally a
> new idea.
>
> One thing I just thought of is that I'm not completely sure that gcc
> will leave those variables in the same place when they're used for
> inputs and outputs. Maybe it expects the asm to move the input to
> the output even if it doesn't do anything else? Note that while they
> *look* like the same variable, there's nothing (that I know of) that
> requires gcc to make that name refer to the same storage all the
> time, just the same value. There's syntax to tell it to use
> particular output as an input too (or the other way around?), and
> that may make this sort of thing less of an issue. I have no good
> reason to think there's actually a problem here, but hypothetically
> it could be yet another problem with playing these sorts of games.
>
> Gabe
>
> Quoting Steve Reinhardt <***@gmail.com>:
>
>> Bleah, this is ugly! Reading that one bug report Gabe linked to, it sounds
>> like -frounding-math is supposed to make this work, but it's not correctly
>> implemented, and as a result there's really no straightforward way to make
>> this work. I think that should be documented somewhere so that one day, if
>> -frounding-math does get implemented properly, we can start relying on it
>> and not on whatever hack we come up with.
>>
>> Another idea, assuming m5_fesetround() isn't inlined, would be to have it
>> accept a double argument that it just passes back unmodified. Then you
>> could do something like:
>>
>> Frs1s = m5_fesetround(newrnd, Frs1s);
>> Frds = Frs1s + Frs2s;
>> Frds = m5_fesetround(oldrnd, Frds);
>>
>> Would that work?
>>
>> Steve
>>
>> On Sat, Oct 29, 2011 at 4:51 PM, Gabe Black <***@eecs.umich.edu> wrote:
>>
>>> I don't think either will work because it's not the optimizations in
>>> those functions or the functions order relative to each other or the
>>> asms, it's the position of the add relative to the asms. Since the add
>>> can move around wherever, it doesn't matter if the calls to fesetround
>>> are bounded by the asms. We could potentially mark the execute function
>>> with a different optimization level though. That might work. Also, I
>>> have that filterDoubles function in there that finds fp operands that
>>> are doubles and builds them from or breaks them down into single floats.
>>> We could possibly piggyback on that to add in asms with the right
>>> properties like in ARM. It's a bit gross, but like you said I don't know
>>> if we can avoid that.
>>>
>>> Gabe
>>>
>>> On 10/29/11 16:31, Ali Saidi wrote:
>>>> If we go down the path below, slighly less hacky might be just making
>>> gcc compiler the entire fenv file without optimization, although perhaps
>>> that is insufficient....
>>>>
>>>> Ali
>>>>
>>>> On Oct 29, 2011, at 6:30 PM, Ali Saidi wrote:
>>>>
>>>>> What about making m5_fesetround and m5_fegetround() modify memory and
>>> thus prevent reordering?
>>>>>
>>>>> Something like:
>>>>>
>>>>> volatile int dummy_compiler;
>>>>>
>>>>> void m5_fesetround(int rm)
>>>>> {
>>>>> assert(rm >= 0 && rm < 4);
>>>>> dummy_compiler++;
>>>>> fesetround(m5_round_ops[rm]);
>>>>> dummy_compiler++;
>>>>> }
>>>>>
>>>>> int m5_fegetround()
>>>>> {
>>>>> int x;
>>>>> dummy_compiler++;
>>>>> int rm = fegetround();
>>>>> dummy_compiler++;
>>>>> for(x = 0; x < 4; x++)
>>>>> if (m5_round_ops[x] == rm)
>>>>> return x;
>>>>> abort();
>>>>> return 0;
>>>>> }
>>>>>
>>>>> Would that just fix it? Mabye m5_round_ops and rm could be made
>>> volatile instead?
>>>>>
>>>>> Another possible solution and hack, but I think we're into hack
>>> territory no matter what since gcc seems brain damaged in this regard:
>>>>>
>>>>> #if __GNUC__ > 3 && __GNUC_MINOR__ > 3 // 4.4 or newer
>>>>> #pragma GCC push_options
>>>>> #pragma GCC optimize ("O0")
>>>>>
>>>>> // m5_fe* goes here
>>>>>
>>>>> #pragma GCC pop_options
>>>>> #endif
>>>>>
>>>>>
>>>>> A third option would be something like
>>>>>
>>>>> void __attribute__((optimize("O0")) m5_fesetround(int rm)...
>>>>>
>>>>> Ali
>>>>>
>>>>>
>>>>> On Oct 29, 2011, at 4:59 PM, Gabe Black wrote:
>>>>>
>>>>>> http://permalink.gmane.org/gmane.comp.gcc.help/38146
>>>>>>
>>>>>> On 10/29/11 14:21, Gabe Black wrote:
>>>>>>> Yes, it doesn't work either. What makes the ARM asm statements work is
>>>>>>> that they have input and output arguments. That ties them into the
>>> data
>>>>>>> flow graph having to do with those values, and they act as anchors,
>>>>>>> forcing values to be produced by the time you get to the asm and not
>>> to
>>>>>>> be consumed before it. Here we're just saying not to trust memory from
>>>>>>> before the asm, and since it's not *in* memory, the compiler merrily
>>>>>>> ignores us. I had this problem with ARM initially too until I added
>>> the
>>>>>>> arguments. I've tried making floating point variables volatile to
>>> ensure
>>>>>>> they're in memory, and that doesn't work either. I think the actual
>>>>>>> semantics of volatile are a little different than what most people
>>>>>>> assume, although I don't remember what the distinction is. One option
>>>>>>> might be to make the FP operation itself a virtual function. Then gcc
>>>>>>> won't know what it does and will be less able to break things by
>>> moving
>>>>>>> things around.
>>>>>>>
>>>>>>> It seems like a pretty severe deficiency of gcc that there's no way to
>>>>>>> make fesetround work properly. It becomes nearly worthless because you
>>>>>>> can't make any assumptions about when it will actually be in effect.
>>>>>>> That's what we have to work with, though.
>>>>>>>
>>>>>>> Gabe
>>>>>>>
>>>>>>> On 10/29/11 13:53, Ali Saidi wrote:
>>>>>>>> I was just about to send a message about -frounding-math when I saw
>>> yours. Interesting that the asm barriers appears to work with ARM. It feels
>>> like there should be an explicit code motion barrier. Anyway, have we tried
>>> compiling with the -frounding-math flag?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Ali
>>>>>>>>
>>>>>>>> Sent from my ARM powered device
>>>>>>>>
>>>>>>>> On Oct 29, 2011, at 3:44 PM, Gabe Black <***@eecs.umich.edu>
>>> wrote:
>>>>>>>>
>>>>>>>>> Here's a discussion on the gcc mailing list of the thing I was
>>> talking
>>>>>>>>> about before that's supposed to fix this, I think.
>>>>>>>>>
>>>>>>>>> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
>>>>>>>>>
>>>>>>>>> Our barriers aren't working since Frs1s, Frs2s, and Frds could all
>>> be
>>>>>>>>> registers.
>>>>>>>>>
>>>>>>>>> Gabe
>>>>>>>>>
>>>>>>>>> On 10/29/11 13:31, Gabe Black wrote:
>>>>>>>>>> Here is some suspect assembly from Fadds for the atomic simple CPU
>>>>>>>>>>
>>>>>>>>>> 0x00000000008d538e <+382>: callq 0x4cab70 <m5_fegetround>
>>>>>>>>>> 0x00000000008d5393 <+387>: mov %eax,%r15d
>>>>>>>>>> 0x00000000008d5396 <+390>: mov %r14d,%edi
>>>>>>>>>> 0x00000000008d5399 <+393>: callq 0x4cab30 <m5_fesetround>
>>>>>>>>>> 0x00000000008d539e <+398>: mov %r15d,%edi
>>>>>>>>>> 0x00000000008d53a1 <+401>: callq 0x4cab30 <m5_fesetround>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is, more or less, from the following code.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>>>>> int oldrnd = m5_fegetround();
>>>>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>>>>> m5_fesetround(newrnd);
>>>>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>>>>> Frds = Frs1s + Frs2s;
>>>>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>>>>> m5_fesetround(oldrnd);
>>>>>>>>>> __asm__ __volatile__ ("" ::: "memory");
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Note that the addition was moved out of the middle and fesetround
>>> was
>>>>>>>>>> called twice back to back, once to set the new rounding mode, and
>>> once
>>>>>>>>>> to set it right back again.
>>>>>>>>>>
>>>>>>>>>> Gabe
>>>>>>>>>>
>>>>>>>>>> On 10/28/11 08:31, Ali Saidi wrote:
>>>>>>>>>>> I'm still not 100% convinced that this is it. I agree it's highly
>>>>>>>>>>> likely, but it could be some other code movement or a bug in the
>>>>>>>>>>> optimizer (we have seen them before). I wonder if you can
>>> selectively
>>>>>>>>>>> optimize functions. Maybe a good start is compiling everything -O3
>>>>>>>>>>> except the atomic execute function and make sure it still works.
>>>>>>>>>>>
>>>>>>>>>>> Ali
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, 28 Oct 2011 07:38:59 -0700, Steve Reinhardt <
>>> ***@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Yes, I think there exists at least one software IEEE FP
>>>>>>>>>>>> implementation out
>>>>>>>>>>>> there that we had talked about incorporating at some point (long
>>> ago).
>>>>>>>>>>>> Unfortunately, as is discussed below, that's not even the issue,
>>> as we
>>>>>>>>>>>> really want to model the not-quite-IEEE (or in the case of x87,
>>>>>>>>>>>> not-even-close) semantics of the hardware alone, which would
>>> require
>>>>>>>>>>>> more
>>>>>>>>>>>> effort.
>>>>>>>>>>>>
>>>>>>>>>>>> If someone really cared about modeling the ISA FP support
>>> precisely that
>>>>>>>>>>>> would be an interesting project, and if it was done cleanly
>>> (probably
>>>>>>>>>>>> with
>>>>>>>>>>>> the option to turn it on or off) we'd be glad to incorporate it.
>>>>>>>>>>>>
>>>>>>>>>>>> Ironically I think the issue here is not that the HW FP is not
>>> good
>>>>>>>>>>>> enough
>>>>>>>>>>>> for our purposes, it's that the software stack doesn't give us
>>> clean
>>>>>>>>>>>> enough
>>>>>>>>>>>> access to the HW facilities (gcc in particular, though C itself
>>> may
>>>>>>>>>>>> share
>>>>>>>>>>>> part of the blame).
>>>>>>>>>>>>
>>>>>>>>>>>> Steve
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Oct 27, 2011 at 11:36 PM, Gabe Black <
>>> ***@eecs.umich.edu>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think there was talk of an FP emulation library a long time
>>> ago
>>>>>>>>>>>>> (before I was involved with M5) but we decided not to do
>>> something like
>>>>>>>>>>>>> that for some reason. Using regular built in FP support gets us
>>> most of
>>>>>>>>>>>>> the way with minimal hassle, but then there are situations like
>>> this
>>>>>>>>>>>>> where it really causes trouble. I presume the prior discussion
>>> might
>>>>>>>>>>>>> have been about whether getting most of the way there was good
>>> enough,
>>>>>>>>>>>>> and that it's simpler.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Gabe
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/27/11 07:43, Radivoje Vasiljevic wrote:
>>>>>>>>>>>>>> ----- Original Message ----- From: "Gabe Black"
>>>>>>>>>>>>> <***@eecs.umich.edu>
>>>>>>>>>>>>>> To: <gem5-***@gem5.org>
>>>>>>>>>>>>>> Sent: 25. октобар 2011 20:53
>>>>>>>>>>>>>> Subject: Re: [gem5-dev] Failed SPARC test
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 10/25/11 07:46, Steve Reinhardt wrote:
>>>>>>>>>>>>>>>> On Tue, Oct 25, 2011 at 2:30 AM, Gabe Black <
>>> ***@eecs.umich.edu>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> [snip]
>>>>>>>>>>>>>>> Yeah, I think ISAs treat IEEE as a really good suggestion
>>> rather
>>>>>>>>>>>>> than a
>>>>>>>>>>>>>>> standard. ARM isn't strictly conformant, and neither is x86.
>>> The
>>>>>>>>>>>>> default
>>>>>>>>>>>>>>> rounding mode *is* standard, though, and I don't think is
>>>>>>>>>>>>> adjusted in
>>>>>>>>>>>>>>> SPARC as a result of execution. If it changed somehow (unless
>>> I'm
>>>>>>>>>>>>>>> forgetting where SPARC does that) it's a fairly significant
>>> problem.
>>>>>>>>>>>>>>> Whether instructions generate +/- 0 in various situations may
>>>>>>>>>>>>> depend on,
>>>>>>>>>>>>>>> for instance, what order gcc decides to put the operands. I'm
>>> not
>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>> that it does, but there are all kinds of weird, subtle
>>> behaviors
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> FP, and you can't just fix how add works if x86 picked the
>>> wrong
>>>>>>>>>>>>> thing.
>>>>>>>>>>>>>>> Then you have to replace add, or semi-replace it by faking it
>>> out
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>> other FP operations. If we're running real x87 instructions
>>> (we
>>>>>>>>>>>>>>> shouldn't be in 64 bit mode, but we still could) then those
>>> use
>>>>>>>>>>>>> 80 bit
>>>>>>>>>>>>>>> operands internally. Where and when rounding takes place
>>> depends
>>>>>>>>>>>>> on when
>>>>>>>>>>>>>>> those are moved in/out of the FPU, and will be different than
>>>>>>>>>>>>> true 64
>>>>>>>>>>>>>>> bit operands. SSE based FP uses real 64 bit doubles, so that
>>> should
>>>>>>>>>>>>>>> behave better. It should also be the default in 64 bit mode
>>> since
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> compiler can assume some basic SSE support is present.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What about FP emulation using integers and some kind of
>>> multiple
>>>>>>>>>>>>>> precision
>>>>>>>>>>>>>> arithmetic? Then every detail could be modeled, including x87
>>>>>>>>>>>>> "floats"
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> "doubles" (in registers exponent field is still 15 bits, not
>>> 8/11 and
>>>>>>>>>>>>>> makes
>>>>>>>>>>>>>> mess of overflow/underflow, or it will go in memory and will be
>>>>>>>>>>>>> proper
>>>>>>>>>>>>>> float/double). Gcc has some switches regarding that behavior
>>> but
>>>>>>>>>>>>> that is
>>>>>>>>>>>>>> very fragile (more like suggestion to compiler then enforcing
>>>>>>>>>>>>> option).
>>>>>>>>>>>>>> Double rounding in x87 is special story because double extended
>>>>>>>>>>>>>> mantissa is not more than twice longer then one for double so
>>> double
>>>>>>>>>>>>>> rounding can give different results compared to single
>>> rounding (this
>>>>>>>>>>>>>> situation can't happen
>>>>>>>>>>>>>> with float vs double). One solution, for example: splitting
>>> mantissas
>>>>>>>>>>>>>> into to halves and performing operation, all bits would be
>>> available
>>>>>>>>>>>>>> and then proper any kind of rounding could be enforced (real
>>> ieee or
>>>>>>>>>>>>>> "isa style ieee"). Performing those operations is not very slow
>>>>>>>>>>>>> and it
>>>>>>>>>>>>>> is fairly ILP reach so slowdown is not that great as when pure
>>> number
>>>>>>>>>>>>>> of instructions is compared (although to have robust code, cpu
>>> and
>>>>>>>>>>>>>> compiler independence, specially about "optimizing code" some
>>> tests
>>>>>>>>>>>>>> are needed to eradicate subnormals due poor support/trap
>>> emulation).
>>>>>>>>>>>>>> Plus if instructions are mixed in right way both int and fpu
>>> units
>>>>>>>>>>>>> can
>>>>>>>>>>>>>> be kept busy. Exponent can be one short and problem solved.
>>> Only
>>>>>>>>>>>>>> division can be somewhattricky (and slow), but it can be done
>>> too.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Even if the FP rounding error isn't the source of the
>>> problem,
>>>>>>>>>>>>> it might
>>>>>>>>>>>>>>>> be
>>>>>>>>>>>>>>>> easiest to fix that and get it out of the way so we can see
>>> what
>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> actual
>>>>>>>>>>>>>>>> problem is.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> If you really want to know *why* the kernel is doing all this
>>>>>>>>>>>>> FP, then
>>>>>>>>>>>>>>>> yes,
>>>>>>>>>>>>>>>> you probably need to look at the source code.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Steve
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>> _______________________________________________
>>>>>>>>>> gem5-dev mailing list
>>>>>>>>>> gem5-***@gem5.org
>>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>> _______________________________________________
>>>>>>>>> gem5-dev mailing list
>>>>>>>>> gem5-***@gem5.org
>>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>> _______________________________________________
>>>>>>>> gem5-dev mailing list
>>>>>>>> gem5-***@gem5.org
>>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>> _______________________________________________
>>>>>>> gem5-dev mailing list
>>>>>>> gem5-***@gem5.org
>>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>> _______________________________________________
>>>>>> gem5-dev mailing list
>>>>>> gem5-***@gem5.org
>>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>> _______________________________________________
>>>>> gem5-dev mailing list
>>>>> gem5-***@gem5.org
>>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>>> _______________________________________________
>>> gem5-dev mailing list
>>> gem5-***@gem5.org
>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Steve Reinhardt
2011-10-31 14:57:53 UTC
Permalink
On Sat, Oct 29, 2011 at 7:09 PM, Gabriel Michael Black <
gblack-***@public.gmane.org> wrote:

> That sounds a bit like what I did with the asm blocks in ARM, except that
> it would modify that function and actually do that call. We'd also need to
> have single and double versions even though the parameter isn't used.


I'm not sure you would need two versions... it seems like if you defined
only a double version, but passed in a single arg and assigned the result
to a float var, the value would get silently converted to double and back
without any side effects (since logically it would just append a bunch of
zeros to the mantissa and then strip them back off, and depending on the
host ISA, it might not do anything at all).


> This is the sort of thing I'm talking about from ARM.
>
> m5_fesetround(newrnd)
> __asm__ __volatile__ ("" : "=m" (Frs1s) : "m" (Frs1s));
> __asm__ __volatile__ ("" : "=m" (Frs2s) : "m" (Frs2s));
>
> Frds = Frs1s + Frs2s;
> __asm__ __volatile__ ("" : "=m" (Frds) : "m" (Frds));
> m5_fesetround(oldrnd)
>
> The gcc is obligated to use the values of Frs1s and Frs2s "returned" by
> the first two asm blocks, and it's obligated to pass the result as a
> parameter to the third asm block. Those constraints pinch it in the middle
> and force the operation to fall inside the m5_fesetround calls.
>

It's a similar idea, but it seems to me that since it doesn't involve the
m5_fesetround call directly, and the compiler might be able to figure out
that there's no way m5_fesetround could legally get a pointer to the memory
locations holding the register values, it still wouldn't be obligated to
order the function calls with respect to the FP operation. In contrast, as
long as the compiler doesn't have any visibility into the body of the
m5_fesetround calls to realize they're not doing anything to the FP arg, it
will have no choice but to strictly order those calls wrt the FP operation.


> One thing I just thought of is that I'm not completely sure that gcc will
> leave those variables in the same place when they're used for inputs and
> outputs. Maybe it expects the asm to move the input to the output even if
> it doesn't do anything else? Note that while they *look* like the same
> variable, there's nothing (that I know of) that requires gcc to make that
> name refer to the same storage all the time, just the same value. There's
> syntax to tell it to use particular output as an input too (or the other
> way around?), and that may make this sort of thing less of an issue. I have
> no good reason to think there's actually a problem here, but hypothetically
> it could be yet another problem with playing these sorts of games.
>

Yea, I can see where the asm statement might need to be a "mov" from the
source to the dest to cover the case where they're not the same location.

Of course, if this asm approach fundamentally doesn't solve the problem,
then the question is moot.

On Sat, Oct 29, 2011 at 7:18 PM, Gabriel Michael Black <
gblack-***@public.gmane.org> wrote:

> Even this isn't fool proof, though. Theoretically here's nothing to
> prevent gcc from doing the same calculation twice, once inside the asm
> blobs and once outside. It may assume the values are the same and use the
> wrong one as the result.


I don't understand what you're saying here, but if we're just agreeing that
the asm thing probably isn't sufficient, there's no need to explain further
:-).


> That -frounding-math would prevent that, I think, since it prevents gcc
> from assuming the rounding mode is always the same.


My interpretation is that -frounding-math would solve all our problems if
only it were implemented correctly. Given that it's not implemented
correctly, can we count on it to do anything for us? Or am I
misinterpreting that bugzilla link you sent?

Steve
nathan binkert
2011-10-31 16:10:47 UTC
Permalink
Not that you guys need more options, but if you put the parameters to
the add in a volatile variable, that should prevent movement, right?

You could also just create a function for doing the add and that
shouldn't get reordered either.

Nate

> Bleah, this is ugly!  Reading that one bug report Gabe linked to, it sounds
> like -frounding-math is supposed to make this work, but it's not correctly
> implemented, and as a result there's really no straightforward way to make
> this work.  I think that should be documented somewhere so that one day, if
> -frounding-math does get implemented properly, we can start relying on it
> and not on whatever hack we come up with.
>
> Another idea, assuming m5_fesetround() isn't inlined, would be to have it
> accept a double argument that it just passes back unmodified.  Then you
> could do something like:
>
> Frs1s = m5_fesetround(newrnd, Frs1s);
> Frds = Frs1s + Frs2s;
> Frds = m5_fesetround(oldrnd, Frds);
>
> Would that work?
Continue reading on narkive:
Loading...