Discussion:
[gem5-dev] Failing regression pc-simple-timing
(too old to reply)
Andreas Hansson
2012-08-06 14:17:04 UTC
Permalink
Hi everyone,

A bit more detail on the mysterious pc-simple-timing that passes in zizzer but fails on a whole range of other systems. It turns out the zizzer stats generated in the test run are wrong(!)

Here's how I found this out:


1) Added a flag where the stat is incremented (num_mem_refs and store_insts in src/cpu/simple/base.cc)

2) Dumped on two systems with the stats .value() printed with a dedicated flag

3) Compared the traces, and they match! This is where my head started to spin... :)

4) Ran the pc-simple-timing manually on zizzer with the command found in the simout and guess what: the stats are the same as on the other systems!

5) Tried running the regression again with scons -j1 to avoid any odd race conditions, but the stats that come out that way are off, suggesting the last instruction (or at least memory access) has simply not happened yet the moment the stats are dumped. Odd to say the least.

I have massaged the tests/SConscript to not use scons Execute command, but instead subprocess.call, and with this change the regressions on zizzer also fail.

In conclusion, either scons or gem5 is doing something odd here. Any scons ninja out there that has any idea? I tried adding a sleep of 10 s before and after, but that does not seem to help. Could it be related to some bug in drain?

Perhaps we can bump the scons version on zizzer just in case?

Andreas

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Nilay
2012-08-06 15:48:27 UTC
Permalink
Post by Andreas Hansson
Hi everyone,
A bit more detail on the mysterious pc-simple-timing that passes in zizzer
but fails on a whole range of other systems. It turns out the zizzer stats
generated in the test run are wrong(!)
1) Added a flag where the stat is incremented (num_mem_refs and
store_insts in src/cpu/simple/base.cc)
2) Dumped on two systems with the stats .value() printed with a dedicated flag
3) Compared the traces, and they match! This is where my head started to spin... :)
4) Ran the pc-simple-timing manually on zizzer with the command found
in the simout and guess what: the stats are the same as on the other
systems!
5) Tried running the regression again with scons -j1 to avoid any odd
race conditions, but the stats that come out that way are off, suggesting
the last instruction (or at least memory access) has simply not happened
yet the moment the stats are dumped. Odd to say the least.
I have massaged the tests/SConscript to not use scons Execute command, but
instead subprocess.call, and with this change the regressions on zizzer
also fail.
In conclusion, either scons or gem5 is doing something odd here. Any scons
ninja out there that has any idea? I tried adding a sleep of 10 s before
and after, but that does not seem to help. Could it be related to some bug
in drain?
Perhaps we can bump the scons version on zizzer just in case?
I was recently testing a patch related to x86 when I also noticed the
problem and the stats actually matched the ones that you had posted
earlier. I think it is some case similar to x86 cpuid instruction. Steve
had mentioned another instruction for which he saw difference in traces
from two different runs. I tried to figure if there was something wrong,
but was not able to.

I think if there was some problem with scons / dumping of stats, we would
have seen this problem occur with other tests as well.

--
Nilay
Ali Saidi
2012-08-07 16:03:35 UTC
Permalink
On Mon, August 6, 2012 9:17
Hi everyone, A bit more detail on the
mysterious pc-simple-timing that passes in zizzer but fails on a whole
range of other systems. It turns out the zizzer stats generated in the
test run are wrong(!) Here's how I found this out: 1) Added a flag where
the stat is incremented (num_mem_refs and store_insts in
src/cpu/simple/base.cc) 2) Dumped on two systems with the stats .value()
printed with a dedicated flag 3) Compared the traces, and they match!
This is where my head started to spin... :) 4) Ran the pc-simple-timing
manually on zizzer with the command found in the simout and guess what:
the stats are the same as on the other systems! 5) Tried running the
regression again with scons -j1 to avoid any odd race conditions, but
the stats that come out that way are off, suggesting the last
instruction (or at least memory access) has simply not happened yet the
moment the stats are dumped. Odd to say the least. I have massaged the
tests/SConscript to not use scons Execute command, but instead
subprocess.call, and with this change the regressions on zizzer also
fail. In conclusion, either scons or gem5 is doing something odd here.
Any scons ninja out there that has any idea? I tried adding a sleep of
10 s before and after, but that does not seem to help. Could it be
related to some bug in drain? Perhaps we can bump the scons version on
zizzer just in case?
I was recently testing a patch related to x86
when I also noticed the
problem and the stats actually matched the
ones that you had posted
earlier. I think it is some case similar to
x86 cpuid instruction. Steve
had mentioned another instruction for
which he saw difference in traces
from two different runs. I tried to
figure if there was something wrong,
but was not able to.
I think
if there was some problem with scons / dumping of stats, we would
have
seen this problem occur with other tests as well.

The difference
between running gem5 in SCons and running it with a subprocess call is
the environment that is present when gem5 starts (what is on the initial
stack mostly). It's likely that the stack differences is causing the
output to change and it's not SCons fault directly.

Thanks,

Ali
Steve Reinhardt
2012-08-07 16:08:42 UTC
Permalink
Post by Ali Saidi
The difference
between running gem5 in SCons and running it with a subprocess call is
the environment that is present when gem5 starts (what is on the initial
stack mostly). It's likely that the stack differences is causing the
output to change and it's not SCons fault directly.
That's my primary (only?) hypothesis, too, but it still seems very weird
that the "alternate" results only show up in this one very specific case
(one invocation method on one machine). It seems like if it really was a
stack size/alignment issue that we'd be seeing inconsistent results on more
than one machine.

Steve
nathan binkert
2012-08-07 16:15:55 UTC
Permalink
Post by Steve Reinhardt
That's my primary (only?) hypothesis, too, but it still seems very weird
that the "alternate" results only show up in this one very specific case
(one invocation method on one machine). It seems like if it really was a
stack size/alignment issue that we'd be seeing inconsistent results on more
than one machine.
It could be a page alignment thing. I bet most stacks start on the
first page (env + command line is usually not 4k i'd wager), but with
this particular case, it could be on the second page. Totally
speculative, but weirder things have been seen.

What seems odd is that this is in FS mode. I wouldn't expect any of
that code to be affected by the environment. Perhaps time for a code
audit to see how we are using the environment. There are a couple of
calls to getenv to get TZ on the C++ side of things.

Nate
Steve Reinhardt
2012-08-08 05:44:02 UTC
Permalink
Post by nathan binkert
Post by Steve Reinhardt
That's my primary (only?) hypothesis, too, but it still seems very weird
that the "alternate" results only show up in this one very specific case
(one invocation method on one machine). It seems like if it really was
a
Post by Steve Reinhardt
stack size/alignment issue that we'd be seeing inconsistent results on
more
Post by Steve Reinhardt
than one machine.
It could be a page alignment thing. I bet most stacks start on the
first page (env + command line is usually not 4k i'd wager), but with
this particular case, it could be on the second page. Totally
speculative, but weirder things have been seen.
What seems odd is that this is in FS mode. I wouldn't expect any of
that code to be affected by the environment. Perhaps time for a code
audit to see how we are using the environment. There are a couple of
calls to getenv to get TZ on the C++ side of things.
I added 'env -i' to the beginning of cmd in tests/SConscript, and the test
passes on zizzer now. So it seems conclusive that it's an environment
thing. That's not the solution though; somehow we need to figure out why
the environment has an impact. It would be nice if we could reproduce the
failure outside of scons, then we could try to use tracediff to track it
down.

Steve
nathan binkert
2012-08-08 05:45:51 UTC
Permalink
Post by Steve Reinhardt
I added 'env -i' to the beginning of cmd in tests/SConscript, and the test
passes on zizzer now. So it seems conclusive that it's an environment
thing. That's not the solution though; somehow we need to figure out why
the environment has an impact. It would be nice if we could reproduce the
failure outside of scons, then we could try to use tracediff to track it
down.
I'd suggest creating a crazy large environment and seeing if you can
get it to happen elsewhere. Create variables A, B, etc. and fill them
each with a few hundred characters.

NateNateNateNateNateNateNateNateNateNateNateNate.....
Nilay Vaish
2012-08-09 16:32:37 UTC
Permalink
Post by Steve Reinhardt
I added 'env -i' to the beginning of cmd in tests/SConscript, and the test
passes on zizzer now. So it seems conclusive that it's an environment
thing. That's not the solution though; somehow we need to figure out why
the environment has an impact. It would be nice if we could reproduce the
failure outside of scons, then we could try to use tracediff to track it
down.
The test was already passing on zizzer. Or is it that you tried that
subprocess.call() thing that Andreas was doing, and the test still passes
when 'env -i' is executed before the test?

--
Nilay

nathan binkert
2012-08-07 16:12:32 UTC
Permalink
Post by Ali Saidi
The difference
between running gem5 in SCons and running it with a subprocess call is
the environment that is present when gem5 starts (what is on the initial
stack mostly). It's likely that the stack differences is causing the
output to change and it's not SCons fault directly.
We had similar problems with ASIM many moons ago. It seems to me that
we should really have a wrapper script for sanitizing the stack.
Something as simple as:

#!/bin/sh
exec /usr/bin/env -i $*


Unfortunately, that's not perfect since we do have a few environment
variables that we use and we also pass arguments on the command line.
Both of which are not fixed length. That said, the SCons environment
is massive compared to most user environments. Using a wrapper
script, we could prune both massively.

Nate
Continue reading on narkive:
Loading...