[gem5-dev] Parallel version of GEM5

Discussion:

[gem5-dev] Parallel version of GEM5

Anirudh Sivaraman

2012-03-29 16:47:47 UTC

I have a design for a parallel version of GEM5. I wanted to run it by
the dev list before jumping in. The idea is to simulate a networked
system of multiple machines. The networking simulation will be handled
by ns3, a standard networking simulator. Each GEM5 instance will
connect into ns3 using a tap device (I hope to use ethertap.cc for
this) and ns3 will act as a "router" forwarding packets between GEM5
instances. Each machine will be simulated by it's own GEM5 instance in
a separate thread and will hook into ns-3 using a tap device (ns-3 has
some support for this). ns3 is pretty flexible and can simulate
wired/wireless networks, but that should hopefully not matter to GEM5.

The natural question is handling synchronization between the simulated
times in the various GEM5 instances. My idea is to use barrier
synchronization between the various GEM5 instances at periodic time
intervals. Let's assume this time interval is 10 ms. Then each GEM5
instance runs from 0 through 10 ms of simulated time, and then waits
until all other GEM5 instances have finished their 10 ms slice as
well. The process then repeats itself from simulated time 10 to 20 ms.
Consequently, instances don't get out of sync by more than 10 ms at
any point. This interval is tunable, a lower interval gives you more
accuracy but more run time as well.

I realize that determinism is impossible in this framework, but that's
a hit I am willing to take for my work. I wanted to know if there were
any code examples on using ethertap.cc just like etherlink.cc
(twosys-tsunami-simple-atomic.py)

Anirudh

nathan binkert

2012-03-29 16:57:43 UTC

Post by Anirudh Sivaraman
I realize that determinism is impossible in this framework, but that's
a hit I am willing to take for my work. I wanted to know if there were
any code examples on using ethertap.cc just like etherlink.cc
(twosys-tsunami-simple-atomic.py)

Hi Anirudh,

I'm the one that wrote the tap interface (something like 8 or 10 years
ago) and I'm pretty certain that it hasn't been used in about that
long. I'm guessing that it doesn't use the right raw interface
mechanism for example. That said, getting tap itself to work should
be pretty easy. I have a few comments/concerns.

1) While you may not be concerned about determinism, you need to make
sure that your clocks don't get wildly off. If they do, TCP timeouts
start getting triggered and bad things happen. It may be surprising,
but one thing that happens when doing networking stuff is that the
simulator can go too *fast*. This is because we simply skip ahead
when the system is idle and the other system may not skip ahead. Gabe
Black (I'm pretty sure it was him) wrote an event a while back that
would ensure that the simulator never got too far ahead of wall clock
time. You may need to use this.

2) I personally wouldn't use the tap interface. I'd probably start
with it, but I'd convert it to use a named pipe or something like that
so I could annotate the packets with timing information. If you use
the real tap interface and involve the linux kernel, I'd just worry
that it would introduce timing issues and complexity that you just
don't want to deal with.

3) Steve Reinhardt did a lot of work around making the simulator
multithreaded. I don't think that work ever made it into the tree,
but there were patches on reviewboard. We should probably revive that
and make sure that you take advantage of whatever he has. You can use
this stuff to help with the barriers.

Good luck. This is something I always wanted to do, but never managed
to make happen. I hope you make it work and contribute any patches
back to the tree.

Thanks,

Nate

Rodrigues, Arun F

2012-03-29 17:28:44 UTC

Anirudh,

You might want to take a look at the SST project here at Sandia
(http://code.google.com/p/sst-simulator/). We've incorporated GeM5 into a
parallel discrete event framework and run it to a few hundred nodes. We
use a latency based conservative optimization similar to what you describe
(i.e. fixed lookahead), but the lookahead is bounded by the minimum
latency between nodes, so it is deterministic. This is all layered over
MPI, and the performance seems reasonable (~80% scaling efficiency).
(Note, we went over MPI instead of threading for scalability reasons. We
eventually want to look at several thousands of nodes, and we don't have
easy access to a large enough multithreaded machine. Our early tests seem
to indicate that there is not a huge performance difference for our use
case.). Currently, we have a NIC based on the Portals API connected to a
model of the router used in the Cray XT3 series or routers from the
Georgia Tech IRIS simulator.

The current version of SST hangs a network interface off the memory bus
(see attached picture), but we are working on a more generic version which
would allow you to run each CPU on a different MPI rank to improve socket
simulation speeds.

Thanks,

arun

Post by Anirudh Sivaraman
I have a design for a parallel version of GEM5. I wanted to run it by
the dev list before jumping in. The idea is to simulate a networked
system of multiple machines. The networking simulation will be handled
by ns3, a standard networking simulator. Each GEM5 instance will
connect into ns3 using a tap device (I hope to use ethertap.cc for
this) and ns3 will act as a "router" forwarding packets between GEM5
instances. Each machine will be simulated by it's own GEM5 instance in
a separate thread and will hook into ns-3 using a tap device (ns-3 has
some support for this). ns3 is pretty flexible and can simulate
wired/wireless networks, but that should hopefully not matter to GEM5.
The natural question is handling synchronization between the simulated
times in the various GEM5 instances. My idea is to use barrier
synchronization between the various GEM5 instances at periodic time
intervals. Let's assume this time interval is 10 ms. Then each GEM5
instance runs from 0 through 10 ms of simulated time, and then waits
until all other GEM5 instances have finished their 10 ms slice as
well. The process then repeats itself from simulated time 10 to 20 ms.
Consequently, instances don't get out of sync by more than 10 ms at
any point. This interval is tunable, a lower interval gives you more
accuracy but more run time as well.
I realize that determinism is impossible in this framework, but that's
a hit I am willing to take for my work. I wanted to know if there were
any code examples on using ethertap.cc just like etherlink.cc
(twosys-tsunami-simple-atomic.py)
Anirudh
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

nathan binkert

2012-03-29 17:43:15 UTC

Doh! I totally forgot about SST. Any patches that you can get back into gem5?

Nate

Post by nathan binkert
Anirudh,
You might want to take a look at the SST project here at Sandia
(http://code.google.com/p/sst-simulator/). We've incorporated GeM5 into a
parallel discrete event framework and run it to a few hundred nodes. We
use a latency based conservative optimization similar to what you describe
(i.e. fixed lookahead), but the lookahead is bounded by the minimum
latency between nodes, so it is deterministic. This is all layered over
MPI, and the performance seems reasonable (~80% scaling efficiency).
(Note, we went over MPI instead of threading for scalability reasons. We
eventually want to look at several thousands of nodes, and we don't have
easy access to a large enough multithreaded machine. Our early tests seem
to indicate that there is not a huge performance difference for our use
case.). Currently, we have a NIC based on the Portals API connected to a
model of the router used in the Cray XT3 series or routers from the
Georgia Tech IRIS simulator.
The current version of SST hangs a network interface off the memory bus
(see attached picture), but we are working on a more generic version which
would allow you to run each CPU on a different MPI rank to improve socket
simulation speeds.
Thanks,
arun

Post by Anirudh Sivaraman
I have a design for a parallel version of GEM5. I wanted to run it by
the dev list before jumping in. The idea is to simulate a networked
system of multiple machines. The networking simulation will be handled
by ns3, a standard networking simulator. Each GEM5 instance will
connect into ns3 using a tap device (I hope to use ethertap.cc for
this) and ns3 will act as a "router" forwarding packets between GEM5
instances. Each machine will be simulated by it's own GEM5 instance in
a separate thread and will hook into ns-3 using a tap device (ns-3 has
some support for this). ns3 is pretty flexible and can simulate
wired/wireless networks, but that should hopefully not matter to GEM5.
The natural question is handling synchronization between the simulated
times in the various GEM5 instances. My idea is to use barrier
synchronization between the various GEM5 instances at periodic time
intervals. Let's assume this time interval is 10 ms. Then each GEM5
instance runs from 0 through 10 ms of simulated time, and then waits
until all other GEM5 instances have finished their 10 ms slice as
well. The process then repeats itself from simulated time 10 to 20 ms.
Consequently, instances don't get out of sync by more than 10 ms at
any point. This interval is tunable, a lower interval gives you more
accuracy but more run time as well.
I realize that determinism is impossible in this framework, but that's
a hit I am willing to take for my work. I wanted to know if there were
any code examples on using ethertap.cc just like etherlink.cc
(twosys-tsunami-simple-atomic.py)
Anirudh
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

Anirudh Sivaraman

2012-03-29 17:49:36 UTC

Thank you very much for the responses ! I had no idea of SST , but it
definitely looks like a great project for me to start looking for
ideas. Reg. the TCP timeouts, I had forgotten that gem5 skips over
idle cycles, so in effect I either need to make the time slice for
syncing much lower than the TCP timeout or I need to disable this
'cycle-skipping'.

While on the topic of SST, do you have any calibration studies wrt
other potentially more-accurate simulators (or) real hardware ? I
would be very interested in looking at any such studies.

Anirudh

Doh! I totally forgot about SST. Any patches that you can get back into gem5?
Nate

Post by nathan binkert
Anirudh,
You might want to take a look at the SST project here at Sandia
(http://code.google.com/p/sst-simulator/). We've incorporated GeM5 into a
parallel discrete event framework and run it to a few hundred nodes. We
use a latency based conservative optimization similar to what you describe
(i.e. fixed lookahead), but the lookahead is bounded by the minimum
latency between nodes, so it is deterministic. This is all layered over
MPI, and the performance seems reasonable (~80% scaling efficiency).
(Note, we went over MPI instead of threading for scalability reasons. We
eventually want to look at several thousands of nodes, and we don't have
easy access to a large enough multithreaded machine. Our early tests seem
to indicate that there is not a huge performance difference for our use
case.). Currently, we have a NIC based on the Portals API connected to a
model of the router used in the Cray XT3 series or routers from the
Georgia Tech IRIS simulator.
The current version of SST hangs a network interface off the memory bus
(see attached picture), but we are working on a more generic version which
would allow you to run each CPU on a different MPI rank to improve socket
simulation speeds.
Thanks,
arun

Post by Anirudh Sivaraman
I have a design for a parallel version of GEM5. I wanted to run it by
the dev list before jumping in. The idea is to simulate a networked
system of multiple machines. The networking simulation will be handled
by ns3, a standard networking simulator. Each GEM5 instance will
connect into ns3 using a tap device (I hope to use ethertap.cc for
this) and ns3 will act as a "router" forwarding packets between GEM5
instances. Each machine will be simulated by it's own GEM5 instance in
a separate thread and will hook into ns-3 using a tap device (ns-3 has
some support for this). ns3 is pretty flexible and can simulate
wired/wireless networks, but that should hopefully not matter to GEM5.
The natural question is handling synchronization between the simulated
times in the various GEM5 instances. My idea is to use barrier
synchronization between the various GEM5 instances at periodic time
intervals. Let's assume this time interval is 10 ms. Then each GEM5
instance runs from 0 through 10 ms of simulated time, and then waits
until all other GEM5 instances have finished their 10 ms slice as
well. The process then repeats itself from simulated time 10 to 20 ms.
Consequently, instances don't get out of sync by more than 10 ms at
any point. This interval is tunable, a lower interval gives you more
accuracy but more run time as well.
I realize that determinism is impossible in this framework, but that's
a hit I am willing to take for my work. I wanted to know if there were
any code examples on using ethertap.cc just like etherlink.cc
(twosys-tsunami-simple-atomic.py)
Anirudh
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

nathan binkert

2012-03-29 17:53:55 UTC

Post by Anirudh Sivaraman
Thank you very much for the responses ! I had no idea of SST , but it
definitely looks like a great project for me to start looking for
ideas. Reg. the TCP timeouts, I had forgotten that gem5 skips over
idle cycles, so in effect I either need to make the time slice for
syncing much lower than the TCP timeout or I need to disable this
'cycle-skipping'.

As I mentioned, there's an event taht you can use to help you with
this. It basically bounds how fast it can skip.

Nate

Anirudh Sivaraman

2012-03-29 17:57:11 UTC

Post by nathan binkert

Post by Anirudh Sivaraman
Thank you very much for the responses ! I had no idea of SST , but it
definitely looks like a great project for me to start looking for
ideas. Reg. the TCP timeouts, I had forgotten that gem5 skips over
idle cycles, so in effect I either need to make the time slice for
syncing much lower than the TCP timeout or I need to disable this
'cycle-skipping'.

As I mentioned, there's an event taht you can use to help you with
this. It basically bounds how fast it can skip.

Is this already in the code base ? If so, what is it called or roughly
where do I need to search for it in the code base ?

Anirudh

Post by nathan binkert
Nate
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

nathan binkert

2012-03-29 18:02:16 UTC

Post by Anirudh Sivaraman
Is this already in the code base ? If so, what is it called or roughly
where do I need to search for it in the code base ?

I honestly don't recall. Gabe? Steve? Am I crazy?

Nate

nathan binkert

2012-03-29 20:50:11 UTC

Is what already in the code base? Multithreaded simulation? Not that I'm
aware of.

No. The thing that prevents the simulator clock from going to fast
with respect to real time.

Nate

Ali Saidi

2012-03-29 22:06:47 UTC

It's the --timesync option on the command line.

Ali

On

Is what already in the code

base? Multithreaded simulation? Not that I'm aware of.

No. The

thing that prevents the simulator clock from going to fast

with

respect to real time.

Nate

Rodrigues, Arun F

2012-03-29 19:12:08 UTC

Post by Anirudh Sivaraman
Thank you very much for the responses ! I had no idea of SST , but it
definitely looks like a great project for me to start looking for
ideas. Reg. the TCP timeouts, I had forgotten that gem5 skips over
idle cycles, so in effect I either need to make the time slice for
syncing much lower than the TCP timeout or I need to disable this
'cycle-skipping'.

What we did initially was to disable the cycle-skipping, but this lead to
some performance degradation. So instead, we just make sure that GeM5 is
called at least once per lookahead period - I.e. we can skip up to
<lookahead> number of cycles. This improved performance by about 20%.

If you run into any issues with GeM5/SST, there is an SST mailing list
which might be useful for the more SST-related parts. (or documentation
is... not great...)

Post by Anirudh Sivaraman
While on the topic of SST, do you have any calibration studies wrt
other potentially more-accurate simulators (or) real hardware ? I
would be very interested in looking at any such studies.

We are currently doing some of that now, mainly for x86 architectures,
comparing against real hardware. Actually, we've been running into some
issues with the cache accuracy and prefetchers. I'll see what exact
numbers I can scrounge up.

Anirudh Sivaraman

2012-03-29 19:30:43 UTC

Post by Rodrigues, Arun F

Post by Anirudh Sivaraman
Thank you very much for the responses ! I had no idea of SST , but it
definitely looks like a great project for me to start looking for
ideas. Reg. the TCP timeouts, I had forgotten that gem5 skips over
idle cycles, so in effect I either need to make the time slice for
syncing much lower than the TCP timeout or I need to disable this
'cycle-skipping'.

What we did initially was to disable the cycle-skipping, but this lead to
some performance degradation. So instead, we just make sure that GeM5 is
called at least once per lookahead period - I.e. we can skip up to
<lookahead> number of cycles. This improved performance by about 20%.
If you run into any issues with GeM5/SST, there is an SST mailing list
which might be useful for the more SST-related parts. (or documentation
is... not great...)

Post by Anirudh Sivaraman
While on the topic of SST, do you have any calibration studies wrt
other potentially more-accurate simulators (or) real hardware ? I
would be very interested in looking at any such studies.

We are currently doing some of that now, mainly for x86 architectures,
comparing against real hardware. Actually, we've been running into some
issues with the cache accuracy and prefetchers. I'll see what exact
numbers I can scrounge up.

Thanks for the information and do let me know if you have any numbers
. I 'll keep the cycle skipping overhead in mind and get back to you
if turns out to be too much for me.

Anirudh

Post by Rodrigues, Arun F
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

Rodrigues, Arun F

2012-03-29 22:09:52 UTC

There are definitely some patches we'd like to push upsteam, but right now
they are a bit of a mess. Once we have the more componentized version a
little more stable we can clean it up.

Post by nathan binkert
Doh! I totally forgot about SST. Any patches that you can get back into gem5?
Nate

Erik Tomusk

2012-03-30 09:42:05 UTC

ARM recently announced an internship to do something similar:
http://www.hipeac.net/content/multi-thread-gem5-simulator

-Erik

Post by Anirudh Sivaraman
I have a design for a parallel version of GEM5. I wanted to run it by
the dev list before jumping in. The idea is to simulate a networked
system of multiple machines. The networking simulation will be handled
by ns3, a standard networking simulator. Each GEM5 instance will
connect into ns3 using a tap device (I hope to use ethertap.cc for
this) and ns3 will act as a "router" forwarding packets between GEM5
instances. Each machine will be simulated by it's own GEM5 instance in
a separate thread and will hook into ns-3 using a tap device (ns-3 has
some support for this). ns3 is pretty flexible and can simulate
wired/wireless networks, but that should hopefully not matter to GEM5.
The natural question is handling synchronization between the simulated
times in the various GEM5 instances. My idea is to use barrier
synchronization between the various GEM5 instances at periodic time
intervals. Let's assume this time interval is 10 ms. Then each GEM5
instance runs from 0 through 10 ms of simulated time, and then waits
until all other GEM5 instances have finished their 10 ms slice as
well. The process then repeats itself from simulated time 10 to 20 ms.
Consequently, instances don't get out of sync by more than 10 ms at
any point. This interval is tunable, a lower interval gives you more
accuracy but more run time as well.
I realize that determinism is impossible in this framework, but that's
a hit I am willing to take for my work. I wanted to know if there were
any code examples on using ethertap.cc just like etherlink.cc
(twosys-tsunami-simple-atomic.py)
Anirudh
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Anirudh Sivaraman

2012-03-30 14:06:20 UTC

Post by Erik Tomusk
http://www.hipeac.net/content/multi-thread-gem5-simulator

That's interesting . Does ARM have anybody/ any group working on
something similar ? Coincidentally, the my research focus is on
simulating networked Android phones. Since most phones today run ARM,
I would mostly be interested in simulating an ARM processor within
each GEM5 instance.

Anirudh

Post by Erik Tomusk
-Erik

Post by Anirudh Sivaraman
I have a design for a parallel version of GEM5. I wanted to run it by
the dev list before jumping in. The idea is to simulate a networked
system of multiple machines. The networking simulation will be handled
by ns3, a standard networking simulator. Each GEM5 instance will
connect into ns3 using a tap device (I hope to use ethertap.cc for
this) and ns3 will act as a "router" forwarding packets between GEM5
instances. Each machine will be simulated by it's own GEM5 instance in
a separate thread and will hook into ns-3 using a tap device (ns-3 has
some support for this). ns3 is pretty flexible and can simulate
wired/wireless networks, but that should hopefully not matter to GEM5.
The natural question is handling synchronization between the simulated
times in the various GEM5 instances. My idea is to use barrier
synchronization between the various GEM5 instances at periodic time
intervals. Let's assume this time interval is 10 ms. Then each GEM5
instance runs from 0 through 10 ms of simulated time, and then waits
until all other GEM5 instances have finished their 10 ms slice as
well. The process then repeats itself from simulated time 10 to 20 ms.
Consequently, instances don't get out of sync by more than 10 ms at
any point. This interval is tunable, a lower interval gives you more
accuracy but more run time as well.
I realize that determinism is impossible in this framework, but that's
a hit I am willing to take for my work. I wanted to know if there were
any code examples on using ethertap.cc just like etherlink.cc
(twosys-tsunami-simple-atomic.py)
Anirudh
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

Erik Tomusk

2012-03-30 15:13:58 UTC

All I know is what it says in the link; you'd have to ask someone from
ARM (maybe someone here on the mailing list knows more). The deadline
for that internship was yesterday, so I'd guess things are not concrete yet.

-Erik

Post by Anirudh Sivaraman

Post by Erik Tomusk
http://www.hipeac.net/content/multi-thread-gem5-simulator

That's interesting . Does ARM have anybody/ any group working on
something similar ? Coincidentally, the my research focus is on
simulating networked Android phones. Since most phones today run ARM,
I would mostly be interested in simulating an ARM processor within
each GEM5 instance.
Anirudh

Post by Erik Tomusk
-Erik

Post by Anirudh Sivaraman
I have a design for a parallel version of GEM5. I wanted to run it by
the dev list before jumping in. The idea is to simulate a networked
system of multiple machines. The networking simulation will be handled
by ns3, a standard networking simulator. Each GEM5 instance will
connect into ns3 using a tap device (I hope to use ethertap.cc for
this) and ns3 will act as a "router" forwarding packets between GEM5
instances. Each machine will be simulated by it's own GEM5 instance in
a separate thread and will hook into ns-3 using a tap device (ns-3 has
some support for this). ns3 is pretty flexible and can simulate
wired/wireless networks, but that should hopefully not matter to GEM5.
The natural question is handling synchronization between the simulated
times in the various GEM5 instances. My idea is to use barrier
synchronization between the various GEM5 instances at periodic time
intervals. Let's assume this time interval is 10 ms. Then each GEM5
instance runs from 0 through 10 ms of simulated time, and then waits
until all other GEM5 instances have finished their 10 ms slice as
well. The process then repeats itself from simulated time 10 to 20 ms.
Consequently, instances don't get out of sync by more than 10 ms at
any point. This interval is tunable, a lower interval gives you more
accuracy but more run time as well.
I realize that determinism is impossible in this framework, but that's
a hit I am willing to take for my work. I wanted to know if there were
any code examples on using ethertap.cc just like etherlink.cc
(twosys-tsunami-simple-atomic.py)
Anirudh
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

15 Replies
25 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Anirudh Sivaraman 2012-03-29 16:47:47 UTC

nathan binkert 2012-03-29 16:57:43 UTC

Rodrigues, Arun F 2012-03-29 17:28:44 UTC

nathan binkert 2012-03-29 17:43:15 UTC

Anirudh Sivaraman 2012-03-29 17:49:36 UTC

nathan binkert 2012-03-29 17:53:55 UTC

Anirudh Sivaraman 2012-03-29 17:57:11 UTC

nathan binkert 2012-03-29 18:02:16 UTC

nathan binkert 2012-03-29 20:50:11 UTC

Ali Saidi 2012-03-29 22:06:47 UTC

Rodrigues, Arun F 2012-03-29 19:12:08 UTC

Anirudh Sivaraman 2012-03-29 19:30:43 UTC

Rodrigues, Arun F 2012-03-29 22:09:52 UTC

Erik Tomusk 2012-03-30 09:42:05 UTC

Anirudh Sivaraman 2012-03-30 14:06:20 UTC

Erik Tomusk 2012-03-30 15:13:58 UTC

about - legalese

Loading...