Discussion:
[gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts
(too old to reply)
Mohammad Alian
2015-06-24 04:05:07 UTC
Permalink
Hello All,

I have submitted a chain of patches which enables gem5 to simulate a
cluster on multiple physical hosts:

http://reviews.gem5.org/r/2909/
http://reviews.gem5.org/r/2910/
http://reviews.gem5.org/r/2912/
http://reviews.gem5.org/r/2913/
http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>

and a patch that contains run scripts for a simple experiment:
http://reviews.gem5.org/r/2915/

We have run several benchmarks using this infrastructure, including NAS
parallel benchmarks (MPI) and DCBench-hadoop (http://prof.ict.ac.cn/DCBench/),
and would be happy to share scripts/diskimages.

We call this *pd-gem5*. *pd-gem5 *functionality is more or less the same as
Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network model is
more thorough; it also enables modeling different network topologies.
Having both set of changes together let reviewers to pick best features
from both works.

Thank you,
Mohammad Alian
Steve Reinhardt
2015-06-24 04:11:24 UTC
Permalink
Thanks for posting, Mohammad! I will try to look your patches over later
this week.

Steve

On Tue, Jun 23, 2015 at 9:05 PM Mohammad Alian <***@wisc.edu> wrote:

> Hello All,
>
> I have submitted a chain of patches which enables gem5 to simulate a
> cluster on multiple physical hosts:
>
> http://reviews.gem5.org/r/2909/
> http://reviews.gem5.org/r/2910/
> http://reviews.gem5.org/r/2912/
> http://reviews.gem5.org/r/2913/
> http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>
>
> and a patch that contains run scripts for a simple experiment:
> http://reviews.gem5.org/r/2915/
>
> We have run several benchmarks using this infrastructure, including NAS
> parallel benchmarks (MPI) and DCBench-hadoop (
> http://prof.ict.ac.cn/DCBench/),
> and would be happy to share scripts/diskimages.
>
> We call this *pd-gem5*. *pd-gem5 *functionality is more or less the same as
> Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network model is
> more thorough; it also enables modeling different network topologies.
> Having both set of changes together let reviewers to pick best features
> from both works.
>
> Thank you,
> Mohammad Alian
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Andreas Hansson
2015-06-24 07:26:02 UTC
Permalink
Hi all,

Great work. However, I fundamentally do not believe in the approach of
‘letting reviewers pick the best features’. There is no way we would ever
get something working out if it. We need to get _one_ working solution
here, and figure out how to best get there. I would propose to do it
bottom up, starting with the basic multi-simulator instance support,
checkpointing support, and then move on to the network between the
simulator instances.

Thus, I propose we go with the low-level plumbing and checkpoint support
from what Curtis has posted. I believe proper checkpointing support to be
the most challenging, and from what I can tell this is far more limited in
what you just posted Mohammad. Could you perhaps review Curtis patches
based on your insights, and we can try and get these patches in shape and
committed asap.

Once we have the baseline functionality in place, then we can start
looking at the more elaborate network models.

Does this sound reasonable?

Thanks,

Andreas

On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
<gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:

>Hello All,
>
>I have submitted a chain of patches which enables gem5 to simulate a
>cluster on multiple physical hosts:
>
>http://reviews.gem5.org/r/2909/
>http://reviews.gem5.org/r/2910/
>http://reviews.gem5.org/r/2912/
>http://reviews.gem5.org/r/2913/
>http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>
>
>and a patch that contains run scripts for a simple experiment:
>http://reviews.gem5.org/r/2915/
>
>We have run several benchmarks using this infrastructure, including NAS
>parallel benchmarks (MPI) and DCBench-hadoop
>(http://prof.ict.ac.cn/DCBench/),
>and would be happy to share scripts/diskimages.
>
>We call this *pd-gem5*. *pd-gem5 *functionality is more or less the same
>as
>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network model
>is
>more thorough; it also enables modeling different network topologies.
>Having both set of changes together let reviewers to pick best features
>from both works.
>
>Thank you,
>Mohammad Alian
>_______________________________________________
>gem5-dev mailing list
>gem5-***@gem5.org
>http://m5sim.org/mailman/listinfo/gem5-dev


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Steve Reinhardt
2015-06-24 12:11:51 UTC
Permalink
Hi Andreas,

I'm a little confused by your email---you say you're fundamentally opposed
to looking at both patches and picking the best features, then you point
out that the patches Curtis posted have the feature of better checkpointing
support so we should pick that :).

Obviously we can't just pick patch A from Mohammad's set and patch B from
Curtis's set and expect them to work together, but I think that having both
sets of patches available and comparing and contrasting the two
implementations should enable us to get to a single implementation that's
the best of both. Someone will have to make the effort of integrating the
better ideas from one set into the other set to create a new unified set of
patches; (or maybe we commit one set and then integrate the best of the
other set as patches on top of that), but the first step is to identify
what "the best of both" is. Having Mohammad look at Curtis's patches, and
Curtis (or someone else from ARM) closely examine Mohammad's patches would
be a great start. I intend to review them both, though unfortunately my
time has been scarce lately---I'm hoping to squeeze that in later this week.

Once we've had a few people look at both, we can discuss the pros and cons
of each, then discuss the strategy for getting the best features in. So
far I've heard that Mohammad's patches have a better network model but the
ARM patches have better checkpointing support; that seems like a good start.

Steve

On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <***@arm.com>
wrote:

> Hi all,
>
> Great work. However, I fundamentally do not believe in the approach of
> ‘letting reviewers pick the best features’. There is no way we would ever
> get something working out if it. We need to get _one_ working solution
> here, and figure out how to best get there. I would propose to do it
> bottom up, starting with the basic multi-simulator instance support,
> checkpointing support, and then move on to the network between the
> simulator instances.
>
> Thus, I propose we go with the low-level plumbing and checkpoint support
> from what Curtis has posted. I believe proper checkpointing support to be
> the most challenging, and from what I can tell this is far more limited in
> what you just posted Mohammad. Could you perhaps review Curtis patches
> based on your insights, and we can try and get these patches in shape and
> committed asap.
>
> Once we have the baseline functionality in place, then we can start
> looking at the more elaborate network models.
>
> Does this sound reasonable?
>
> Thanks,
>
> Andreas
>
> On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> <gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>
> >Hello All,
> >
> >I have submitted a chain of patches which enables gem5 to simulate a
> >cluster on multiple physical hosts:
> >
> >http://reviews.gem5.org/r/2909/
> >http://reviews.gem5.org/r/2910/
> >http://reviews.gem5.org/r/2912/
> >http://reviews.gem5.org/r/2913/
> >http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>
> >
> >and a patch that contains run scripts for a simple experiment:
> >http://reviews.gem5.org/r/2915/
> >
> >We have run several benchmarks using this infrastructure, including NAS
> >parallel benchmarks (MPI) and DCBench-hadoop
> >(http://prof.ict.ac.cn/DCBench/),
> >and would be happy to share scripts/diskimages.
> >
> >We call this *pd-gem5*. *pd-gem5 *functionality is more or less the same
> >as
> >Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network model
> >is
> >more thorough; it also enables modeling different network topologies.
> >Having both set of changes together let reviewers to pick best features
> >from both works.
> >
> >Thank you,
> >Mohammad Alian
> >_______________________________________________
> >gem5-dev mailing list
> >gem5-***@gem5.org
> >http://m5sim.org/mailman/listinfo/gem5-dev
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Andreas Hansson
2015-06-24 12:25:58 UTC
Permalink
Hi Steve,

Apologies for the confusion. We are on the same page. My point is that we
cannot simply take a little bit of patch A and a little bit of patch B.
This change involves a lot of code, and we need to approach this in a
structured fashion. My proposal is to do it bottom up, and start by
getting the basic support in place. Since http://reviews.gem5.org/r/2826/
has already been on the review board for a few months, I am merely
suggesting that the it would be a good start to relate the newly posted
patches to what is already there.

Andreas



On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
<gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:

>Hi Andreas,
>
>I'm a little confused by your email---you say you're fundamentally opposed
>to looking at both patches and picking the best features, then you point
>out that the patches Curtis posted have the feature of better
>checkpointing
>support so we should pick that :).
>
>Obviously we can't just pick patch A from Mohammad's set and patch B from
>Curtis's set and expect them to work together, but I think that having
>both
>sets of patches available and comparing and contrasting the two
>implementations should enable us to get to a single implementation that's
>the best of both. Someone will have to make the effort of integrating the
>better ideas from one set into the other set to create a new unified set
>of
>patches; (or maybe we commit one set and then integrate the best of the
>other set as patches on top of that), but the first step is to identify
>what "the best of both" is. Having Mohammad look at Curtis's patches, and
>Curtis (or someone else from ARM) closely examine Mohammad's patches would
>be a great start. I intend to review them both, though unfortunately my
>time has been scarce lately---I'm hoping to squeeze that in later this
>week.
>
>Once we've had a few people look at both, we can discuss the pros and cons
>of each, then discuss the strategy for getting the best features in. So
>far I've heard that Mohammad's patches have a better network model but the
>ARM patches have better checkpointing support; that seems like a good
>start.
>
>Steve
>
>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <***@arm.com>
>wrote:
>
>> Hi all,
>>
>> Great work. However, I fundamentally do not believe in the approach of
>> ‘letting reviewers pick the best features’. There is no way we would
>>ever
>> get something working out if it. We need to get _one_ working solution
>> here, and figure out how to best get there. I would propose to do it
>> bottom up, starting with the basic multi-simulator instance support,
>> checkpointing support, and then move on to the network between the
>> simulator instances.
>>
>> Thus, I propose we go with the low-level plumbing and checkpoint support
>> from what Curtis has posted. I believe proper checkpointing support to
>>be
>> the most challenging, and from what I can tell this is far more limited
>>in
>> what you just posted Mohammad. Could you perhaps review Curtis patches
>> based on your insights, and we can try and get these patches in shape
>>and
>> committed asap.
>>
>> Once we have the baseline functionality in place, then we can start
>> looking at the more elaborate network models.
>>
>> Does this sound reasonable?
>>
>> Thanks,
>>
>> Andreas
>>
>> On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
>> <gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>>
>> >Hello All,
>> >
>> >I have submitted a chain of patches which enables gem5 to simulate a
>> >cluster on multiple physical hosts:
>> >
>> >http://reviews.gem5.org/r/2909/
>> >http://reviews.gem5.org/r/2910/
>> >http://reviews.gem5.org/r/2912/
>> >http://reviews.gem5.org/r/2913/
>> >http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>
>> >
>> >and a patch that contains run scripts for a simple experiment:
>> >http://reviews.gem5.org/r/2915/
>> >
>> >We have run several benchmarks using this infrastructure, including NAS
>> >parallel benchmarks (MPI) and DCBench-hadoop
>> >(http://prof.ict.ac.cn/DCBench/),
>> >and would be happy to share scripts/diskimages.
>> >
>> >We call this *pd-gem5*. *pd-gem5 *functionality is more or less the
>>same
>> >as
>> >Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network
>>model
>> >is
>> >more thorough; it also enables modeling different network topologies.
>> >Having both set of changes together let reviewers to pick best features
>> >from both works.
>> >
>> >Thank you,
>> >Mohammad Alian
>> >_______________________________________________
>> >gem5-dev mailing list
>> >gem5-***@gem5.org
>> >http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>
>> -- IMPORTANT NOTICE: The contents of this email and any attachments are
>> confidential and may also be privileged. If you are not the intended
>> recipient, please notify the sender immediately and do not disclose the
>> contents to any other person, use it for any purpose, or store or copy
>>the
>> information in any medium. Thank you.
>>
>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> Registered in England & Wales, Company No: 2557590
>> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>> Registered in England & Wales, Company No: 2548782
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>_______________________________________________
>gem5-dev mailing list
>gem5-***@gem5.org
>http://m5sim.org/mailman/listinfo/gem5-dev


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Mohammad Alian
2015-06-24 19:43:06 UTC
Permalink
Hi Andreas,

Thanks for the comment.
I think the checkpointing support in both works is the same. Here is how
checkpointing support is implemented in pd-gem5:

Whenever one of gem5 processes encounter an m5-checkpoint pseudo
instruction, it will send a “recv-ckpt” signal to the
“barrier” process. Then the “barrier” process sends a “take-ckpt” signal to
all the simulated nodes
(including the node that encountered m5-checkpoint) at the end of the
current simulation quantum. On the reception of
“take-ckpt” signal, gem5 processes start dumping check-points. This makes
each simulated node dump a checkpoint
at the same simulated time point while ensuring there is no in-flight
packets.

I believe this is the same as multi-gem5 patch approach for checkpoint
support (based on the commit message of http://reviews.gem5.org/r/2865/).
Also, we have tested our mechanism with several benchmarks and it works. As
Steve suggested, I'll look into Curtis's patch and try to review it as
well.
But as Nilay also mentioned earlier, there are some codes missing in
Curtis's patch. I prefer to first run multi-gem5 before starting to review
it.

Thank you,
Mohammad

On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <***@arm.com>
wrote:

> Hi Steve,
>
> Apologies for the confusion. We are on the same page. My point is that we
> cannot simply take a little bit of patch A and a little bit of patch B.
> This change involves a lot of code, and we need to approach this in a
> structured fashion. My proposal is to do it bottom up, and start by
> getting the basic support in place. Since http://reviews.gem5.org/r/2826/
> has already been on the review board for a few months, I am merely
> suggesting that the it would be a good start to relate the newly posted
> patches to what is already there.
>
> Andreas
>
>
>
> On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> <gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
>
> >Hi Andreas,
> >
> >I'm a little confused by your email---you say you're fundamentally opposed
> >to looking at both patches and picking the best features, then you point
> >out that the patches Curtis posted have the feature of better
> >checkpointing
> >support so we should pick that :).
> >
> >Obviously we can't just pick patch A from Mohammad's set and patch B from
> >Curtis's set and expect them to work together, but I think that having
> >both
> >sets of patches available and comparing and contrasting the two
> >implementations should enable us to get to a single implementation that's
> >the best of both. Someone will have to make the effort of integrating the
> >better ideas from one set into the other set to create a new unified set
> >of
> >patches; (or maybe we commit one set and then integrate the best of the
> >other set as patches on top of that), but the first step is to identify
> >what "the best of both" is. Having Mohammad look at Curtis's patches, and
> >Curtis (or someone else from ARM) closely examine Mohammad's patches would
> >be a great start. I intend to review them both, though unfortunately my
> >time has been scarce lately---I'm hoping to squeeze that in later this
> >week.
> >
> >Once we've had a few people look at both, we can discuss the pros and cons
> >of each, then discuss the strategy for getting the best features in. So
> >far I've heard that Mohammad's patches have a better network model but the
> >ARM patches have better checkpointing support; that seems like a good
> >start.
> >
> >Steve
> >
> >On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <***@arm.com
> >
> >wrote:
> >
> >> Hi all,
> >>
> >> Great work. However, I fundamentally do not believe in the approach of
> >> ‘letting reviewers pick the best features’. There is no way we would
> >>ever
> >> get something working out if it. We need to get _one_ working solution
> >> here, and figure out how to best get there. I would propose to do it
> >> bottom up, starting with the basic multi-simulator instance support,
> >> checkpointing support, and then move on to the network between the
> >> simulator instances.
> >>
> >> Thus, I propose we go with the low-level plumbing and checkpoint support
> >> from what Curtis has posted. I believe proper checkpointing support to
> >>be
> >> the most challenging, and from what I can tell this is far more limited
> >>in
> >> what you just posted Mohammad. Could you perhaps review Curtis patches
> >> based on your insights, and we can try and get these patches in shape
> >>and
> >> committed asap.
> >>
> >> Once we have the baseline functionality in place, then we can start
> >> looking at the more elaborate network models.
> >>
> >> Does this sound reasonable?
> >>
> >> Thanks,
> >>
> >> Andreas
> >>
> >> On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> >> <gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> >>
> >> >Hello All,
> >> >
> >> >I have submitted a chain of patches which enables gem5 to simulate a
> >> >cluster on multiple physical hosts:
> >> >
> >> >http://reviews.gem5.org/r/2909/
> >> >http://reviews.gem5.org/r/2910/
> >> >http://reviews.gem5.org/r/2912/
> >> >http://reviews.gem5.org/r/2913/
> >> >http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>
> >> >
> >> >and a patch that contains run scripts for a simple experiment:
> >> >http://reviews.gem5.org/r/2915/
> >> >
> >> >We have run several benchmarks using this infrastructure, including NAS
> >> >parallel benchmarks (MPI) and DCBench-hadoop
> >> >(http://prof.ict.ac.cn/DCBench/),
> >> >and would be happy to share scripts/diskimages.
> >> >
> >> >We call this *pd-gem5*. *pd-gem5 *functionality is more or less the
> >>same
> >> >as
> >> >Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network
> >>model
> >> >is
> >> >more thorough; it also enables modeling different network topologies.
> >> >Having both set of changes together let reviewers to pick best features
> >> >from both works.
> >> >
> >> >Thank you,
> >> >Mohammad Alian
> >> >_______________________________________________
> >> >gem5-dev mailing list
> >> >gem5-***@gem5.org
> >> >http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >>
> >> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> >> confidential and may also be privileged. If you are not the intended
> >> recipient, please notify the sender immediately and do not disclose the
> >> contents to any other person, use it for any purpose, or store or copy
> >>the
> >> information in any medium. Thank you.
> >>
> >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >> Registered in England & Wales, Company No: 2557590
> >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >> Registered in England & Wales, Company No: 2548782
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-***@gem5.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >_______________________________________________
> >gem5-dev mailing list
> >gem5-***@gem5.org
> >http://m5sim.org/mailman/listinfo/gem5-dev
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gutierrez, Anthony
2015-06-26 01:36:42 UTC
Permalink
Would it make sense for me to ship the EtherSwitch patch first, since it has utility on its own, and then we can decide which of the "multi-gem5" approaches is best, or if it's some combination of both?

The only reason I never shipped it was because Steve raised an issue that I didn't have a good alternative for, and didn't have the time to look into one at that time.
________________________________________
From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad Alian [***@wisc.edu]
Sent: Wednesday, June 24, 2015 12:43 PM
To: gem5 Developer List
Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts

Hi Andreas,

Thanks for the comment.
I think the checkpointing support in both works is the same. Here is how
checkpointing support is implemented in pd-gem5:

Whenever one of gem5 processes encounter an m5-checkpoint pseudo
instruction, it will send a “recv-ckpt” signal to the
“barrier” process. Then the “barrier” process sends a “take-ckpt” signal to
all the simulated nodes
(including the node that encountered m5-checkpoint) at the end of the
current simulation quantum. On the reception of
“take-ckpt” signal, gem5 processes start dumping check-points. This makes
each simulated node dump a checkpoint
at the same simulated time point while ensuring there is no in-flight
packets.

I believe this is the same as multi-gem5 patch approach for checkpoint
support (based on the commit message of http://reviews.gem5.org/r/2865/).
Also, we have tested our mechanism with several benchmarks and it works. As
Steve suggested, I'll look into Curtis's patch and try to review it as
well.
But as Nilay also mentioned earlier, there are some codes missing in
Curtis's patch. I prefer to first run multi-gem5 before starting to review
it.

Thank you,
Mohammad

On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <***@arm.com>
wrote:

> Hi Steve,
>
> Apologies for the confusion. We are on the same page. My point is that we
> cannot simply take a little bit of patch A and a little bit of patch B.
> This change involves a lot of code, and we need to approach this in a
> structured fashion. My proposal is to do it bottom up, and start by
> getting the basic support in place. Since http://reviews.gem5.org/r/2826/
> has already been on the review board for a few months, I am merely
> suggesting that the it would be a good start to relate the newly posted
> patches to what is already there.
>
> Andreas
>
>
>
> On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> <gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
>
> >Hi Andreas,
> >
> >I'm a little confused by your email---you say you're fundamentally opposed
> >to looking at both patches and picking the best features, then you point
> >out that the patches Curtis posted have the feature of better
> >checkpointing
> >support so we should pick that :).
> >
> >Obviously we can't just pick patch A from Mohammad's set and patch B from
> >Curtis's set and expect them to work together, but I think that having
> >both
> >sets of patches available and comparing and contrasting the two
> >implementations should enable us to get to a single implementation that's
> >the best of both. Someone will have to make the effort of integrating the
> >better ideas from one set into the other set to create a new unified set
> >of
> >patches; (or maybe we commit one set and then integrate the best of the
> >other set as patches on top of that), but the first step is to identify
> >what "the best of both" is. Having Mohammad look at Curtis's patches, and
> >Curtis (or someone else from ARM) closely examine Mohammad's patches would
> >be a great start. I intend to review them both, though unfortunately my
> >time has been scarce lately---I'm hoping to squeeze that in later this
> >week.
> >
> >Once we've had a few people look at both, we can discuss the pros and cons
> >of each, then discuss the strategy for getting the best features in. So
> >far I've heard that Mohammad's patches have a better network model but the
> >ARM patches have better checkpointing support; that seems like a good
> >start.
> >
> >Steve
> >
> >On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <***@arm.com
> >
> >wrote:
> >
> >> Hi all,
> >>
> >> Great work. However, I fundamentally do not believe in the approach of
> >> ‘letting reviewers pick the best features’. There is no way we would
> >>ever
> >> get something working out if it. We need to get _one_ working solution
> >> here, and figure out how to best get there. I would propose to do it
> >> bottom up, starting with the basic multi-simulator instance support,
> >> checkpointing support, and then move on to the network between the
> >> simulator instances.
> >>
> >> Thus, I propose we go with the low-level plumbing and checkpoint support
> >> from what Curtis has posted. I believe proper checkpointing support to
> >>be
> >> the most challenging, and from what I can tell this is far more limited
> >>in
> >> what you just posted Mohammad. Could you perhaps review Curtis patches
> >> based on your insights, and we can try and get these patches in shape
> >>and
> >> committed asap.
> >>
> >> Once we have the baseline functionality in place, then we can start
> >> looking at the more elaborate network models.
> >>
> >> Does this sound reasonable?
> >>
> >> Thanks,
> >>
> >> Andreas
> >>
> >> On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> >> <gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> >>
> >> >Hello All,
> >> >
> >> >I have submitted a chain of patches which enables gem5 to simulate a
> >> >cluster on multiple physical hosts:
> >> >
> >> >http://reviews.gem5.org/r/2909/
> >> >http://reviews.gem5.org/r/2910/
> >> >http://reviews.gem5.org/r/2912/
> >> >http://reviews.gem5.org/r/2913/
> >> >http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>
> >> >
> >> >and a patch that contains run scripts for a simple experiment:
> >> >http://reviews.gem5.org/r/2915/
> >> >
> >> >We have run several benchmarks using this infrastructure, including NAS
> >> >parallel benchmarks (MPI) and DCBench-hadoop
> >> >(http://prof.ict.ac.cn/DCBench/),
> >> >and would be happy to share scripts/diskimages.
> >> >
> >> >We call this *pd-gem5*. *pd-gem5 *functionality is more or less the
> >>same
> >> >as
> >> >Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network
> >>model
> >> >is
> >> >more thorough; it also enables modeling different network topologies.
> >> >Having both set of changes together let reviewers to pick best features
> >> >from both works.
> >> >
> >> >Thank you,
> >> >Mohammad Alian
> >> >_______________________________________________
> >> >gem5-dev mailing list
> >> >gem5-***@gem5.org
> >> >http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >>
> >> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> >> confidential and may also be privileged. If you are not the intended
> >> recipient, please notify the sender immediately and do not disclose the
> >> contents to any other person, use it for any purpose, or store or copy
> >>the
> >> information in any medium. Thank you.
> >>
> >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >> Registered in England & Wales, Company No: 2557590
> >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >> Registered in England & Wales, Company No: 2548782
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-***@gem5.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >_______________________________________________
> >gem5-dev mailing list
> >gem5-***@gem5.org
> >http://m5sim.org/mailman/listinfo/gem5-dev
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
gem5-***@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev
Mohammad Alian
2015-06-27 00:37:51 UTC
Permalink
Hi Anthony,

I think that would be a good option, then I can add pd-gem5 functionality
on top of that. Right now I've simplified your implementation. Also, I
think I had found some bugs in your patch that I cannot remember now. If
you decided to ship EtherSwitch patch, let me know to give you a review on
that.

Thanks,
Mohammad

On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
***@amd.com> wrote:

> Would it make sense for me to ship the EtherSwitch patch first, since it
> has utility on its own, and then we can decide which of the "multi-gem5"
> approaches is best, or if it's some combination of both?
>
> The only reason I never shipped it was because Steve raised an issue that
> I didn't have a good alternative for, and didn't have the time to look into
> one at that time.
> ________________________________________
> From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad Alian [
> ***@wisc.edu]
> Sent: Wednesday, June 24, 2015 12:43 PM
> To: gem5 Developer List
> Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system
> on multiple physical hosts
>
> Hi Andreas,
>
> Thanks for the comment.
> I think the checkpointing support in both works is the same. Here is how
> checkpointing support is implemented in pd-gem5:
>
> Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> instruction, it will send a “recv-ckpt” signal to the
> “barrier” process. Then the “barrier” process sends a “take-ckpt” signal to
> all the simulated nodes
> (including the node that encountered m5-checkpoint) at the end of the
> current simulation quantum. On the reception of
> “take-ckpt” signal, gem5 processes start dumping check-points. This makes
> each simulated node dump a checkpoint
> at the same simulated time point while ensuring there is no in-flight
> packets.
>
> I believe this is the same as multi-gem5 patch approach for checkpoint
> support (based on the commit message of http://reviews.gem5.org/r/2865/).
> Also, we have tested our mechanism with several benchmarks and it works. As
> Steve suggested, I'll look into Curtis's patch and try to review it as
> well.
> But as Nilay also mentioned earlier, there are some codes missing in
> Curtis's patch. I prefer to first run multi-gem5 before starting to review
> it.
>
> Thank you,
> Mohammad
>
> On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <***@arm.com>
> wrote:
>
> > Hi Steve,
> >
> > Apologies for the confusion. We are on the same page. My point is that we
> > cannot simply take a little bit of patch A and a little bit of patch B.
> > This change involves a lot of code, and we need to approach this in a
> > structured fashion. My proposal is to do it bottom up, and start by
> > getting the basic support in place. Since
> http://reviews.gem5.org/r/2826/
> > has already been on the review board for a few months, I am merely
> > suggesting that the it would be a good start to relate the newly posted
> > patches to what is already there.
> >
> > Andreas
> >
> >
> >
> > On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> > <gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
> >
> > >Hi Andreas,
> > >
> > >I'm a little confused by your email---you say you're fundamentally
> opposed
> > >to looking at both patches and picking the best features, then you point
> > >out that the patches Curtis posted have the feature of better
> > >checkpointing
> > >support so we should pick that :).
> > >
> > >Obviously we can't just pick patch A from Mohammad's set and patch B
> from
> > >Curtis's set and expect them to work together, but I think that having
> > >both
> > >sets of patches available and comparing and contrasting the two
> > >implementations should enable us to get to a single implementation
> that's
> > >the best of both. Someone will have to make the effort of integrating
> the
> > >better ideas from one set into the other set to create a new unified set
> > >of
> > >patches; (or maybe we commit one set and then integrate the best of the
> > >other set as patches on top of that), but the first step is to identify
> > >what "the best of both" is. Having Mohammad look at Curtis's patches,
> and
> > >Curtis (or someone else from ARM) closely examine Mohammad's patches
> would
> > >be a great start. I intend to review them both, though unfortunately my
> > >time has been scarce lately---I'm hoping to squeeze that in later this
> > >week.
> > >
> > >Once we've had a few people look at both, we can discuss the pros and
> cons
> > >of each, then discuss the strategy for getting the best features in. So
> > >far I've heard that Mohammad's patches have a better network model but
> the
> > >ARM patches have better checkpointing support; that seems like a good
> > >start.
> > >
> > >Steve
> > >
> > >On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> ***@arm.com
> > >
> > >wrote:
> > >
> > >> Hi all,
> > >>
> > >> Great work. However, I fundamentally do not believe in the approach of
> > >> ‘letting reviewers pick the best features’. There is no way we would
> > >>ever
> > >> get something working out if it. We need to get _one_ working solution
> > >> here, and figure out how to best get there. I would propose to do it
> > >> bottom up, starting with the basic multi-simulator instance support,
> > >> checkpointing support, and then move on to the network between the
> > >> simulator instances.
> > >>
> > >> Thus, I propose we go with the low-level plumbing and checkpoint
> support
> > >> from what Curtis has posted. I believe proper checkpointing support to
> > >>be
> > >> the most challenging, and from what I can tell this is far more
> limited
> > >>in
> > >> what you just posted Mohammad. Could you perhaps review Curtis patches
> > >> based on your insights, and we can try and get these patches in shape
> > >>and
> > >> committed asap.
> > >>
> > >> Once we have the baseline functionality in place, then we can start
> > >> looking at the more elaborate network models.
> > >>
> > >> Does this sound reasonable?
> > >>
> > >> Thanks,
> > >>
> > >> Andreas
> > >>
> > >> On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> > >> <gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> > >>
> > >> >Hello All,
> > >> >
> > >> >I have submitted a chain of patches which enables gem5 to simulate a
> > >> >cluster on multiple physical hosts:
> > >> >
> > >> >http://reviews.gem5.org/r/2909/
> > >> >http://reviews.gem5.org/r/2910/
> > >> >http://reviews.gem5.org/r/2912/
> > >> >http://reviews.gem5.org/r/2913/
> > >> >http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>
> > >> >
> > >> >and a patch that contains run scripts for a simple experiment:
> > >> >http://reviews.gem5.org/r/2915/
> > >> >
> > >> >We have run several benchmarks using this infrastructure, including
> NAS
> > >> >parallel benchmarks (MPI) and DCBench-hadoop
> > >> >(http://prof.ict.ac.cn/DCBench/),
> > >> >and would be happy to share scripts/diskimages.
> > >> >
> > >> >We call this *pd-gem5*. *pd-gem5 *functionality is more or less the
> > >>same
> > >> >as
> > >> >Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network
> > >>model
> > >> >is
> > >> >more thorough; it also enables modeling different network topologies.
> > >> >Having both set of changes together let reviewers to pick best
> features
> > >> >from both works.
> > >> >
> > >> >Thank you,
> > >> >Mohammad Alian
> > >> >_______________________________________________
> > >> >gem5-dev mailing list
> > >> >gem5-***@gem5.org
> > >> >http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >>
> > >> -- IMPORTANT NOTICE: The contents of this email and any attachments
> are
> > >> confidential and may also be privileged. If you are not the intended
> > >> recipient, please notify the sender immediately and do not disclose
> the
> > >> contents to any other person, use it for any purpose, or store or copy
> > >>the
> > >> information in any medium. Thank you.
> > >>
> > >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > >> Registered in England & Wales, Company No: 2557590
> > >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> > >>9NJ,
> > >> Registered in England & Wales, Company No: 2548782
> > >> _______________________________________________
> > >> gem5-dev mailing list
> > >> gem5-***@gem5.org
> > >> http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >_______________________________________________
> > >gem5-dev mailing list
> > >gem5-***@gem5.org
> > >http://m5sim.org/mailman/listinfo/gem5-dev
> >
> >
> > -- IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy
> the
> > information in any medium. Thank you.
> >
> > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No: 2557590
> > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No: 2548782
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-***@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Curtis Dunham
2015-06-27 01:40:44 UTC
Permalink
Hello everyone,
We have taken a look at how pd-gem5 compares with multi-gem5. While intending
to deliver the same functionality, there are some crucial differences:

* Synchronization.

pd-gem5 implements this in Python (not a problem in itself; aesthetically
this is nice, but...). The issue is that pd-gem5's data packets and
barrier messages travel over different sockets. Since pd-gem5 could see
data packets passing synchronization barriers, it could create an
inconsistent checkpoint.

multi-gem5's synchronization is implemented in C++ using sync events, but
more importantly, the messages queue up in the same stream and so cannot
have the issue just described. (Event ordering is often crucial in
snapshot protocols.) Therefore we feel that multi-gem5 is a more robust
solution in this respect.

* Packet handling.

pd-gem5 uses EtherTap for data packets but changed the polling mechanism
to go through the main event queue. Since this rate is actually linked
with simulator progress, it cannot guarantee that the packets are serviced
at regular intervals of real time. This can lead to packets queueing up
which would contribute to the synchronization issues mentioned above.

multi-gem5 uses plain sockets with separate receive threads and so does not
have this issue.

* Checkpoint accuracy.

A user would like to have a checkpoint at precisely the time the
'm5 checkpoint' operation is executed so as to not miss any of the
area of interest in his application.

pd-gem5 requires that simulation finish the current quantum
before checkpointing, so it cannot provide this.

(Shortening the quantum can help, but usually the snapshot is being taken
while 'fast-forwarding', i.e. simulating as fast as possible, which would
motivate a longer quantum.)

multi-gem5 can enter the drain cycle immediately upon receiving a
checkpoint request. We find this accuracy highly desirable.

* Implementation of network topology.

pd-gem5 uses a separate gem5 process to act as a switch whereas multi-gem5
uses a standalone packet relay process.

We haven't measured the overhead of pd-gem5's simulated switch yet, but
we're confident that our approach is at least as fast and more scalable.


Thanks,
Curtis
________________________________________
From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad Alian [***@wisc.edu]
Sent: Friday, June 26, 2015 7:37 PM
To: gem5 Developer List
Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts

Hi Anthony,

I think that would be a good option, then I can add pd-gem5 functionality
on top of that. Right now I've simplified your implementation. Also, I
think I had found some bugs in your patch that I cannot remember now. If
you decided to ship EtherSwitch patch, let me know to give you a review on
that.

Thanks,
Mohammad

On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
***@amd.com> wrote:

> Would it make sense for me to ship the EtherSwitch patch first, since it
> has utility on its own, and then we can decide which of the "multi-gem5"
> approaches is best, or if it's some combination of both?
>
> The only reason I never shipped it was because Steve raised an issue that
> I didn't have a good alternative for, and didn't have the time to look into
> one at that time.
> ________________________________________
> From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad Alian [
> ***@wisc.edu]
> Sent: Wednesday, June 24, 2015 12:43 PM
> To: gem5 Developer List
> Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system
> on multiple physical hosts
>
> Hi Andreas,
>
> Thanks for the comment.
> I think the checkpointing support in both works is the same. Here is how
> checkpointing support is implemented in pd-gem5:
>
> Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> instruction, it will send a “recv-ckpt” signal to the
> “barrier” process. Then the “barrier” process sends a “take-ckpt” signal to
> all the simulated nodes
> (including the node that encountered m5-checkpoint) at the end of the
> current simulation quantum. On the reception of
> “take-ckpt” signal, gem5 processes start dumping check-points. This makes
> each simulated node dump a checkpoint
> at the same simulated time point while ensuring there is no in-flight
> packets.
>
> I believe this is the same as multi-gem5 patch approach for checkpoint
> support (based on the commit message of http://reviews.gem5.org/r/2865/).
> Also, we have tested our mechanism with several benchmarks and it works. As
> Steve suggested, I'll look into Curtis's patch and try to review it as
> well.
> But as Nilay also mentioned earlier, there are some codes missing in
> Curtis's patch. I prefer to first run multi-gem5 before starting to review
> it.
>
> Thank you,
> Mohammad
>
> On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <***@arm.com>
> wrote:
>
> > Hi Steve,
> >
> > Apologies for the confusion. We are on the same page. My point is that we
> > cannot simply take a little bit of patch A and a little bit of patch B.
> > This change involves a lot of code, and we need to approach this in a
> > structured fashion. My proposal is to do it bottom up, and start by
> > getting the basic support in place. Since
> http://reviews.gem5.org/r/2826/
> > has already been on the review board for a few months, I am merely
> > suggesting that the it would be a good start to relate the newly posted
> > patches to what is already there.
> >
> > Andreas
> >
> >
> >
> > On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> > <gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
> >
> > >Hi Andreas,
> > >
> > >I'm a little confused by your email---you say you're fundamentally
> opposed
> > >to looking at both patches and picking the best features, then you point
> > >out that the patches Curtis posted have the feature of better
> > >checkpointing
> > >support so we should pick that :).
> > >
> > >Obviously we can't just pick patch A from Mohammad's set and patch B
> from
> > >Curtis's set and expect them to work together, but I think that having
> > >both
> > >sets of patches available and comparing and contrasting the two
> > >implementations should enable us to get to a single implementation
> that's
> > >the best of both. Someone will have to make the effort of integrating
> the
> > >better ideas from one set into the other set to create a new unified set
> > >of
> > >patches; (or maybe we commit one set and then integrate the best of the
> > >other set as patches on top of that), but the first step is to identify
> > >what "the best of both" is. Having Mohammad look at Curtis's patches,
> and
> > >Curtis (or someone else from ARM) closely examine Mohammad's patches
> would
> > >be a great start. I intend to review them both, though unfortunately my
> > >time has been scarce lately---I'm hoping to squeeze that in later this
> > >week.
> > >
> > >Once we've had a few people look at both, we can discuss the pros and
> cons
> > >of each, then discuss the strategy for getting the best features in. So
> > >far I've heard that Mohammad's patches have a better network model but
> the
> > >ARM patches have better checkpointing support; that seems like a good
> > >start.
> > >
> > >Steve
> > >
> > >On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> ***@arm.com
> > >
> > >wrote:
> > >
> > >> Hi all,
> > >>
> > >> Great work. However, I fundamentally do not believe in the approach of
> > >> ‘letting reviewers pick the best features’. There is no way we would
> > >>ever
> > >> get something working out if it. We need to get _one_ working solution
> > >> here, and figure out how to best get there. I would propose to do it
> > >> bottom up, starting with the basic multi-simulator instance support,
> > >> checkpointing support, and then move on to the network between the
> > >> simulator instances.
> > >>
> > >> Thus, I propose we go with the low-level plumbing and checkpoint
> support
> > >> from what Curtis has posted. I believe proper checkpointing support to
> > >>be
> > >> the most challenging, and from what I can tell this is far more
> limited
> > >>in
> > >> what you just posted Mohammad. Could you perhaps review Curtis patches
> > >> based on your insights, and we can try and get these patches in shape
> > >>and
> > >> committed asap.
> > >>
> > >> Once we have the baseline functionality in place, then we can start
> > >> looking at the more elaborate network models.
> > >>
> > >> Does this sound reasonable?
> > >>
> > >> Thanks,
> > >>
> > >> Andreas
> > >>
> > >> On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> > >> <gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> > >>
> > >> >Hello All,
> > >> >
> > >> >I have submitted a chain of patches which enables gem5 to simulate a
> > >> >cluster on multiple physical hosts:
> > >> >
> > >> >http://reviews.gem5.org/r/2909/
> > >> >http://reviews.gem5.org/r/2910/
> > >> >http://reviews.gem5.org/r/2912/
> > >> >http://reviews.gem5.org/r/2913/
> > >> >http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>
> > >> >
> > >> >and a patch that contains run scripts for a simple experiment:
> > >> >http://reviews.gem5.org/r/2915/
> > >> >
> > >> >We have run several benchmarks using this infrastructure, including
> NAS
> > >> >parallel benchmarks (MPI) and DCBench-hadoop
> > >> >(http://prof.ict.ac.cn/DCBench/),
> > >> >and would be happy to share scripts/diskimages.
> > >> >
> > >> >We call this *pd-gem5*. *pd-gem5 *functionality is more or less the
> > >>same
> > >> >as
> > >> >Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network
> > >>model
> > >> >is
> > >> >more thorough; it also enables modeling different network topologies.
> > >> >Having both set of changes together let reviewers to pick best
> features
> > >> >from both works.
> > >> >
> > >> >Thank you,
> > >> >Mohammad Alian
> > >> >_______________________________________________
> > >> >gem5-dev mailing list
> > >> >gem5-***@gem5.org
> > >> >http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >>
> > >> -- IMPORTANT NOTICE: The contents of this email and any attachments
> are
> > >> confidential and may also be privileged. If you are not the intended
> > >> recipient, please notify the sender immediately and do not disclose
> the
> > >> contents to any other person, use it for any purpose, or store or copy
> > >>the
> > >> information in any medium. Thank you.
> > >>
> > >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > >> Registered in England & Wales, Company No: 2557590
> > >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> > >>9NJ,
> > >> Registered in England & Wales, Company No: 2548782
> > >> _______________________________________________
> > >> gem5-dev mailing list
> > >> gem5-***@gem5.org
> > >> http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >_______________________________________________
> > >gem5-dev mailing list
> > >gem5-***@gem5.org
> > >http://m5sim.org/mailman/listinfo/gem5-dev
> >
> >
> > -- IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy
> the
> > information in any medium. Thank you.
> >
> > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No: 2557590
> > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No: 2548782
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-***@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
gem5-***@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Mohammad Alian
2015-06-27 09:20:11 UTC
Permalink
Hi All,

Curtis-Thank you for listing some of the differences. I was waiting for the
completed multi-gem5 patch before I send my review. Please see my inline
response below. I’ve addressed the concerns that you’ve raised. Also, I’ve
added a bit more to the comparison.

-* Synchronization.

pd-gem5 implements this in Python (not a problem in itself; aesthetically

this is nice, but...). The issue is that pd-gem5's data packets and

barrier messages travel over different sockets. Since pd-gem5 could see

data packets passing synchronization barriers, it could create an

inconsistent checkpoint.

multi-gem5's synchronization is implemented in C++ using sync events, but

more importantly, the messages queue up in the same stream and so cannot

have the issue just described. (Event ordering is often crucial in

snapshot protocols.) Therefore we feel that multi-gem5 is a more robust

solution in this respect.

Each packet in pd-gem5 has a time-stamp. So even if data packets pass
synchronization barriers (in another word data packets arrive early at the
destination node), destination node process packets based on their
timestamp. Actually allowing data packets to pass sync barriers is a nice
feature that can reduce the likelihood of late packet reception. Ordering
of data messages that flow over pd-gem5 nodes is also preserved in pd-gem5
implementation.

What you mentioned as an advantage for multi-gem5 is actually a key
disadvantage: buffering sync messages behind data packets can add up to the
synchronization overhead and slow down simulation significantly. Also,
multi-gem5 send huge sized messages (multiHeaderPkt) through network to
perform each synchronization point, which increases synchronization
overhead further. In pd-gem5, we choose to send just one character as sync
message through a separate socket to reduce synchronization overhead.

* Packet handling.

pd-gem5 uses EtherTap for data packets but changed the polling mechanism

to go through the main event queue. Since this rate is actually linked

with simulator progress, it cannot guarantee that the packets are serviced

at regular intervals of real time. This can lead to packets queueing up

which would contribute to the synchronization issues mentioned above.

multi-gem5 uses plain sockets with separate receive threads and so does not

have this issue.

I think again you are pointing to your first concern that I’ve explained
above. Packets that have queued up in EtherTap socket, will be processed
and delivered to simulation environment at the beginning of next simulation
quantum.

Please notice that multi-gem5 introduces a new simObjects to interface
simulation environment to real world which is redundant. This functionality
is already there by EtherTap.

* Checkpoint accuracy.

A user would like to have a checkpoint at precisely the time the

'm5 checkpoint' operation is executed so as to not miss any of the

area of interest in his application.

pd-gem5 requires that simulation finish the current quantum

before checkpointing, so it cannot provide this.

(Shortening the quantum can help, but usually the snapshot is being taken

while 'fast-forwarding', i.e. simulating as fast as possible, which would

motivate a longer quantum.)

multi-gem5 can enter the drain cycle immediately upon receiving a

checkpoint request. We find this accuracy highly desirable.

It’s true that if you have a large quantum size then there would be some
discrepancy between the m5_ckpt instruction tick and the actual dump tick.
Based on multi-gem5 code, my understanding is that you send async
checkpoint message as soon as one of the gem5 processes encounter m5_ckpt
instruction. But I’m not sure how you fix the aforementioned issue, because
you have to sync all gem5 processes before you start dumping checkpoint,
which necessitate a global synchronization beforehand.

By the way, we have a fix for this issue by introducing a new m5 pseudo
instruction.

* Implementation of network topology.

pd-gem5 uses a separate gem5 process to act as a switch whereas multi-gem5

uses a standalone packet relay process.

We haven't measured the overhead of pd-gem5's simulated switch yet, but

we're confident that our approach is at least as fast and more scalable.

There is this flexibility in pd-gem5 to simulate a switch box alongside one
of the other gem5 processes. However, it might make that gem5 process the
simulation bottleneck. One of the advantages of pd-gem5 over multi-gem5 is
that we use gem5 to simulate a switch box, which allows us to model any
network topology by instantiating several Switch simObjects and
interconnect them with EhterLink in an arbitrary fashion. A standalone tcp
server just can provide switch functionality (forwarding packets to
destinations) and model a star network topology. Furthermore, it cannot
model various network timings such as queueing delay, congestion, and
routing latency. Also it has some accuracy issues that I will point out
next.

* Broken network timing:

Forwarding packets between gem5 processes using a standalone tcp server can
cause reordering between packets that have different source but same
destination. It causes inaccurate network timing and worse of all
non-deterministic simulation. pd-gem5 resolve this by reordering packets at
Switch process and then send them to their destination (it’s possible as
switch is synchronized with the rest of the nodes).

* Amount of changes

pd-gem5 introduce different modes in etherlink just to provide accurate
timing for each component in the network subsystem (NIC, link, switch) as
well as capability of modeling different network topologies (mesh, ring,
fat tree, etc). To enable a simple functionality, like what multi-gem5
provides, the amount of changes in gem5 can be limited to time-stamping
packets and providing synchronization through python scripts. However,
multi-gem5 re-implements functionalists that are already in gem5.

* Integrating with gem5 mainstream:

pd-gem5 launch script is written in python which is suited for integration
with gem5 python scripts. However multi-gem5 uses bash script. Also, all
source files in pd-gem5 are already parts of gem5 mainstream. However
multi-gem5 has tcp_server.cc/hh that is a standalone process and cannot be
part of gem5.

Thank you,
Mohammad

On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham <***@arm.com>
wrote:

> Hello everyone,
> We have taken a look at how pd-gem5 compares with multi-gem5. While
> intending
> to deliver the same functionality, there are some crucial differences:
>
> * Synchronization.
>
> pd-gem5 implements this in Python (not a problem in itself;
> aesthetically
> this is nice, but...). The issue is that pd-gem5's data packets and
> barrier messages travel over different sockets. Since pd-gem5 could see
> data packets passing synchronization barriers, it could create an
> inconsistent checkpoint.
>
> multi-gem5's synchronization is implemented in C++ using sync events,
> but
> more importantly, the messages queue up in the same stream and so cannot
> have the issue just described. (Event ordering is often crucial in
> snapshot protocols.) Therefore we feel that multi-gem5 is a more robust
> solution in this respect.
>
> * Packet handling.
>
> pd-gem5 uses EtherTap for data packets but changed the polling mechanism
> to go through the main event queue. Since this rate is actually linked
> with simulator progress, it cannot guarantee that the packets are
> serviced
> at regular intervals of real time. This can lead to packets queueing up
> which would contribute to the synchronization issues mentioned above.
>
> multi-gem5 uses plain sockets with separate receive threads and so does
> not
> have this issue.
>
> * Checkpoint accuracy.
>
> A user would like to have a checkpoint at precisely the time the
> 'm5 checkpoint' operation is executed so as to not miss any of the
> area of interest in his application.
>
> pd-gem5 requires that simulation finish the current quantum
> before checkpointing, so it cannot provide this.
>
> (Shortening the quantum can help, but usually the snapshot is being taken
> while 'fast-forwarding', i.e. simulating as fast as possible, which would
> motivate a longer quantum.)
>
> multi-gem5 can enter the drain cycle immediately upon receiving a
> checkpoint request. We find this accuracy highly desirable.
>
> * Implementation of network topology.
>
> pd-gem5 uses a separate gem5 process to act as a switch whereas
> multi-gem5
> uses a standalone packet relay process.
>
> We haven't measured the overhead of pd-gem5's simulated switch yet, but
> we're confident that our approach is at least as fast and more scalable.
>
>
> Thanks,
> Curtis
> ________________________________________
> From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad Alian [
> ***@wisc.edu]
> Sent: Friday, June 26, 2015 7:37 PM
> To: gem5 Developer List
> Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system
> on multiple physical hosts
>
> Hi Anthony,
>
> I think that would be a good option, then I can add pd-gem5 functionality
> on top of that. Right now I've simplified your implementation. Also, I
> think I had found some bugs in your patch that I cannot remember now. If
> you decided to ship EtherSwitch patch, let me know to give you a review on
> that.
>
> Thanks,
> Mohammad
>
> On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
> ***@amd.com> wrote:
>
> > Would it make sense for me to ship the EtherSwitch patch first, since it
> > has utility on its own, and then we can decide which of the "multi-gem5"
> > approaches is best, or if it's some combination of both?
> >
> > The only reason I never shipped it was because Steve raised an issue that
> > I didn't have a good alternative for, and didn't have the time to look
> into
> > one at that time.
> > ________________________________________
> > From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad Alian [
> > ***@wisc.edu]
> > Sent: Wednesday, June 24, 2015 12:43 PM
> > To: gem5 Developer List
> > Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system
> > on multiple physical hosts
> >
> > Hi Andreas,
> >
> > Thanks for the comment.
> > I think the checkpointing support in both works is the same. Here is how
> > checkpointing support is implemented in pd-gem5:
> >
> > Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> > instruction, it will send a “recv-ckpt” signal to the
> > “barrier” process. Then the “barrier” process sends a “take-ckpt” signal
> to
> > all the simulated nodes
> > (including the node that encountered m5-checkpoint) at the end of the
> > current simulation quantum. On the reception of
> > “take-ckpt” signal, gem5 processes start dumping check-points. This makes
> > each simulated node dump a checkpoint
> > at the same simulated time point while ensuring there is no in-flight
> > packets.
> >
> > I believe this is the same as multi-gem5 patch approach for checkpoint
> > support (based on the commit message of http://reviews.gem5.org/r/2865/
> ).
> > Also, we have tested our mechanism with several benchmarks and it works.
> As
> > Steve suggested, I'll look into Curtis's patch and try to review it as
> > well.
> > But as Nilay also mentioned earlier, there are some codes missing in
> > Curtis's patch. I prefer to first run multi-gem5 before starting to
> review
> > it.
> >
> > Thank you,
> > Mohammad
> >
> > On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
> ***@arm.com>
> > wrote:
> >
> > > Hi Steve,
> > >
> > > Apologies for the confusion. We are on the same page. My point is that
> we
> > > cannot simply take a little bit of patch A and a little bit of patch B.
> > > This change involves a lot of code, and we need to approach this in a
> > > structured fashion. My proposal is to do it bottom up, and start by
> > > getting the basic support in place. Since
> > http://reviews.gem5.org/r/2826/
> > > has already been on the review board for a few months, I am merely
> > > suggesting that the it would be a good start to relate the newly posted
> > > patches to what is already there.
> > >
> > > Andreas
> > >
> > >
> > >
> > > On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> > > <gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
> > >
> > > >Hi Andreas,
> > > >
> > > >I'm a little confused by your email---you say you're fundamentally
> > opposed
> > > >to looking at both patches and picking the best features, then you
> point
> > > >out that the patches Curtis posted have the feature of better
> > > >checkpointing
> > > >support so we should pick that :).
> > > >
> > > >Obviously we can't just pick patch A from Mohammad's set and patch B
> > from
> > > >Curtis's set and expect them to work together, but I think that having
> > > >both
> > > >sets of patches available and comparing and contrasting the two
> > > >implementations should enable us to get to a single implementation
> > that's
> > > >the best of both. Someone will have to make the effort of integrating
> > the
> > > >better ideas from one set into the other set to create a new unified
> set
> > > >of
> > > >patches; (or maybe we commit one set and then integrate the best of
> the
> > > >other set as patches on top of that), but the first step is to
> identify
> > > >what "the best of both" is. Having Mohammad look at Curtis's patches,
> > and
> > > >Curtis (or someone else from ARM) closely examine Mohammad's patches
> > would
> > > >be a great start. I intend to review them both, though unfortunately
> my
> > > >time has been scarce lately---I'm hoping to squeeze that in later this
> > > >week.
> > > >
> > > >Once we've had a few people look at both, we can discuss the pros and
> > cons
> > > >of each, then discuss the strategy for getting the best features in.
> So
> > > >far I've heard that Mohammad's patches have a better network model but
> > the
> > > >ARM patches have better checkpointing support; that seems like a good
> > > >start.
> > > >
> > > >Steve
> > > >
> > > >On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> > ***@arm.com
> > > >
> > > >wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> Great work. However, I fundamentally do not believe in the approach
> of
> > > >> ‘letting reviewers pick the best features’. There is no way we would
> > > >>ever
> > > >> get something working out if it. We need to get _one_ working
> solution
> > > >> here, and figure out how to best get there. I would propose to do it
> > > >> bottom up, starting with the basic multi-simulator instance support,
> > > >> checkpointing support, and then move on to the network between the
> > > >> simulator instances.
> > > >>
> > > >> Thus, I propose we go with the low-level plumbing and checkpoint
> > support
> > > >> from what Curtis has posted. I believe proper checkpointing support
> to
> > > >>be
> > > >> the most challenging, and from what I can tell this is far more
> > limited
> > > >>in
> > > >> what you just posted Mohammad. Could you perhaps review Curtis
> patches
> > > >> based on your insights, and we can try and get these patches in
> shape
> > > >>and
> > > >> committed asap.
> > > >>
> > > >> Once we have the baseline functionality in place, then we can start
> > > >> looking at the more elaborate network models.
> > > >>
> > > >> Does this sound reasonable?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Andreas
> > > >>
> > > >> On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> > > >> <gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> > > >>
> > > >> >Hello All,
> > > >> >
> > > >> >I have submitted a chain of patches which enables gem5 to simulate
> a
> > > >> >cluster on multiple physical hosts:
> > > >> >
> > > >> >http://reviews.gem5.org/r/2909/
> > > >> >http://reviews.gem5.org/r/2910/
> > > >> >http://reviews.gem5.org/r/2912/
> > > >> >http://reviews.gem5.org/r/2913/
> > > >> >http://reviews.gem5.org/r/2914/ <http://reviews.gem5.org/r/2914/>
> > > >> >
> > > >> >and a patch that contains run scripts for a simple experiment:
> > > >> >http://reviews.gem5.org/r/2915/
> > > >> >
> > > >> >We have run several benchmarks using this infrastructure, including
> > NAS
> > > >> >parallel benchmarks (MPI) and DCBench-hadoop
> > > >> >(http://prof.ict.ac.cn/DCBench/),
> > > >> >and would be happy to share scripts/diskimages.
> > > >> >
> > > >> >We call this *pd-gem5*. *pd-gem5 *functionality is more or less the
> > > >>same
> > > >> >as
> > > >> >Curtis's patch for *multi-gem5.* However, I feel *pd-gem5 *network
> > > >>model
> > > >> >is
> > > >> >more thorough; it also enables modeling different network
> topologies.
> > > >> >Having both set of changes together let reviewers to pick best
> > features
> > > >> >from both works.
> > > >> >
> > > >> >Thank you,
> > > >> >Mohammad Alian
> > > >> >_______________________________________________
> > > >> >gem5-dev mailing list
> > > >> >gem5-***@gem5.org
> > > >> >http://m5sim.org/mailman/listinfo/gem5-dev
> > > >>
> > > >>
> > > >> -- IMPORTANT NOTICE: The contents of this email and any attachments
> > are
> > > >> confidential and may also be privileged. If you are not the intended
> > > >> recipient, please notify the sender immediately and do not disclose
> > the
> > > >> contents to any other person, use it for any purpose, or store or
> copy
> > > >>the
> > > >> information in any medium. Thank you.
> > > >>
> > > >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > > >> Registered in England & Wales, Company No: 2557590
> > > >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> > > >>9NJ,
> > > >> Registered in England & Wales, Company No: 2548782
> > > >> _______________________________________________
> > > >> gem5-dev mailing list
> > > >> gem5-***@gem5.org
> > > >> http://m5sim.org/mailman/listinfo/gem5-dev
> > > >>
> > > >_______________________________________________
> > > >gem5-dev mailing list
> > > >gem5-***@gem5.org
> > > >http://m5sim.org/mailman/listinfo/gem5-dev
> > >
> > >
> > > -- IMPORTANT NOTICE: The contents of this email and any attachments are
> > > confidential and may also be privileged. If you are not the intended
> > > recipient, please notify the sender immediately and do not disclose the
> > > contents to any other person, use it for any purpose, or store or copy
> > the
> > > information in any medium. Thank you.
> > >
> > > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > > Registered in England & Wales, Company No: 2557590
> > > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> 9NJ,
> > > Registered in England & Wales, Company No: 2548782
> > > _______________________________________________
> > > gem5-dev mailing list
> > > gem5-***@gem5.org
> > > http://m5sim.org/mailman/listinfo/gem5-dev
> > >
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-***@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-***@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
>
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabor Dozsa
2015-07-01 10:20:52 UTC
Permalink
Hi All,

Thank you Mohammad for your elaboration on the issues!

I have written most of the multi-gem5 patch so let me add some more clarifications and answer to your concerns. My comments are inline below.

Thanks,
- Gabor

On 27/06/2015 10:20, "gem5-dev on behalf of Mohammad Alian"
<gem5-dev-***@gem5.org<mailto:gem5-dev-***@gem5.org> on behalf of ***@wisc.edu<mailto:***@wisc.edu>> wrote:

Hi All,

Curtis-Thank you for listing some of the differences. I was waiting for
the
completed multi-gem5 patch before I send my review. Please see my inline
response below. I’ve addressed the concerns that you’ve raised. Also, I’ve
added a bit more to the comparison.

-* Synchronization.

pd-gem5 implements this in Python (not a problem in itself; aesthetically

this is nice, but...). The issue is that pd-gem5's data packets and

barrier messages travel over different sockets. Since pd-gem5 could see

data packets passing synchronization barriers, it could create an

inconsistent checkpoint.

multi-gem5's synchronization is implemented in C++ using sync events, but

more importantly, the messages queue up in the same stream and so cannot

have the issue just described. (Event ordering is often crucial in

snapshot protocols.) Therefore we feel that multi-gem5 is a more robust

solution in this respect.

Each packet in pd-gem5 has a time-stamp. So even if data packets pass
synchronization barriers (in another word data packets arrive early at the
destination node), destination node process packets based on their
timestamp. Actually allowing data packets to pass sync barriers is a nice
feature that can reduce the likelihood of late packet reception. Ordering
of data messages that flow over pd-gem5 nodes is also preserved in pd-gem5
implementation.

This seems to be a misunderstanding. Maybe the wording was not precise before. The problem isn’t a data packet that “passing” a sync barrier but the other way around, a sync barrier that can pass a data packet (e.g. while the data packet is waiting in the host operating system socket layer). If that happens, the packet will arrive later than it was supposed to and it may miss the computed receive tick.

For instance, let’s assume that the quantum coincides with the simulated Ether link delay. (This is the optimal choice of quantum to minimize the number of sync barriers.) If a data packet is sent right at the beginning of a quantum then this packet must arrive at the destination gem5 process within the same quantum in order not to miss its receive tick at the very beginning of the next quantum. If the sync barrier can pass the data packet then the data packet may arrive only during the next quantum (or in extreme conditions even later than that) so when it arrives the receiver gem5 may pass already the receive tick.

Time-stamping does help with this issue. Also, if a data packet is waiting in the host operating system socket layer when the simulation thread exits to python to complete the next sync barrier then the packet will not go into the checkpoint that may follow that sync barrier.


What you mentioned as an advantage for multi-gem5 is actually a key
disadvantage: buffering sync messages behind data packets can add up to
the
synchronization overhead and slow down simulation significantly.

The purpose of sync messages is to make sure that the data packets arrive in time (in terms of simulated time) at the destination so they can be scheduled for being received at the proper computed tick. Sync messages also make sure that no data packets are in flight when a sync barrier completes before we take a checkpoint. They definitely add overhead for the simulation but they are necessary for the correctness of the simulation.

The receive thread in multi-gem5 reads out packets from the socket in parallel with the simulation thread so packets normally will not be "queueing up” before a sync barrier message. There is definitely rooms for improvement in the current implementation for reducing the synchronization overhead but that is likely true for pd-gem5, too. The important thing here is that the solution must provide correctness (robustness) first.

Also,
multi-gem5 send huge sized messages (multiHeaderPkt) through network to
perform each synchronization point, which increases synchronization
overhead further. In pd-gem5, we choose to send just one character as sync
message through a separate socket to reduce synchronization overhead.

The TCP/IP message size is unlikely the bottleneck here. Multi-gem5 will send ~50 bytes more in a sync barrier message than pd-gem5 but that bigger sync message still fits into a single ethernet frame on the wire. The end-to-end latency overhead that is caused by 50 bytes extra payload for a small single frame TCP/IP message is likely to fall into the ‘noise’ category if one tries to measure it in a real cluster.


* Packet handling.

pd-gem5 uses EtherTap for data packets but changed the polling mechanism

to go through the main event queue. Since this rate is actually linked

with simulator progress, it cannot guarantee that the packets are
serviced

at regular intervals of real time. This can lead to packets queueing up

which would contribute to the synchronization issues mentioned above.

multi-gem5 uses plain sockets with separate receive threads and so does
not

have this issue.

I think again you are pointing to your first concern that I’ve explained
above. Packets that have queued up in EtherTap socket, will be processed
and delivered to simulation environment at the beginning of next
simulation
quantum.

As I pointed out above, packet queued up in the EtherTap socket may miss the proper quantum to get received and/or a checkpoint to be saved.


Please notice that multi-gem5 introduces a new simObjects to interface
simulation environment to real world which is redundant. This
functionality
is already there by EtherTap.

Except that the EtherTap solution does not provide a correct (robust) solution for the synchronization problem.


* Checkpoint accuracy.

A user would like to have a checkpoint at precisely the time the

'm5 checkpoint' operation is executed so as to not miss any of the

area of interest in his application.

pd-gem5 requires that simulation finish the current quantum

before checkpointing, so it cannot provide this.

(Shortening the quantum can help, but usually the snapshot is being taken

while 'fast-forwarding', i.e. simulating as fast as possible, which would

motivate a longer quantum.)

multi-gem5 can enter the drain cycle immediately upon receiving a

checkpoint request. We find this accuracy highly desirable.

It’s true that if you have a large quantum size then there would be some
discrepancy between the m5_ckpt instruction tick and the actual dump tick.
Based on multi-gem5 code, my understanding is that you send async
checkpoint message as soon as one of the gem5 processes encounter m5_ckpt
instruction. But I’m not sure how you fix the aforementioned issue,
because
you have to sync all gem5 processes before you start dumping checkpoint,
which necessitate a global synchronization beforehand.

In multi-gem5, the gem5 process who encounters the m5_ckpt instruction sends out an async checkpoint notification for the peer gem5 processes and then it starts the draining immediately (at the same tick). So the checkpoint will be taken at the exact tick form the initiator process point of view. The global synchronisation with the peer processes takes place while the initiator process is still waiting at the same tick (i.e the simulation thread is suspended). However, the receiver thread continues reading out the socket - while waiting for the global sync to complete- to make sure that in-flight data packets from peer gem5 processes are stored properly and saved into the checkpoint.


By the way, we have a fix for this issue by introducing a new m5 pseudo
instruction.

I fail to see how a new pseudo instruction can solve the problem of completing the full quantum in pd-gem5 before a checkpoint can be taken. Could you please elaborate on that?


* Implementation of network topology.

pd-gem5 uses a separate gem5 process to act as a switch whereas multi-gem5

uses a standalone packet relay process.

We haven't measured the overhead of pd-gem5's simulated switch yet, but

we're confident that our approach is at least as fast and more scalable.

There is this flexibility in pd-gem5 to simulate a switch box alongside
one
of the other gem5 processes. However, it might make that gem5 process the
simulation bottleneck. One of the advantages of pd-gem5 over multi-gem5 is
that we use gem5 to simulate a switch box, which allows us to model any
network topology by instantiating several Switch simObjects and
interconnect them with EhterLink in an arbitrary fashion. A standalone tcp
server just can provide switch functionality (forwarding packets to
destinations) and model a star network topology. Furthermore, it cannot
model various network timings such as queueing delay, congestion, and
routing latency. Also it has some accuracy issues that I will point out
next.

I agree with the complex topology argument –I already mentioned that before as an advantage for pd-gem5 from the point of view of future extensions. However, I do not agree that multi-gem5 cannot model queueing delays and congestions. For a simple crossbar switch, it can model queueing delays and congestions, but the receive queues are distributed among gem5 processes.


* Broken network timing:

Forwarding packets between gem5 processes using a standalone tcp server
can
cause reordering between packets that have different source but same
destination. It causes inaccurate network timing and worse of all
non-deterministic simulation. pd-gem5 resolve this by reordering packets
at
Switch process and then send them to their destination (it’s possible as
switch is synchronized with the rest of the nodes).

In multi-gem5, there is always a HeaderPkt that contains some meta information for each data packet. The meta information include the send tick and the sender rank (i.e. a unique ID of the sender gem5 process). We use those information to define a well defined ordering of packets even if packets are arriving at the same receiver from different senders. This packet ordering scheme is still being tested so the corresponding patch is not on the RB yet.


* Amount of changes

pd-gem5 introduce different modes in etherlink just to provide accurate
timing for each component in the network subsystem (NIC, link, switch) as
well as capability of modeling different network topologies (mesh, ring,
fat tree, etc). To enable a simple functionality, like what multi-gem5
provides, the amount of changes in gem5 can be limited to time-stamping
packets and providing synchronization through python scripts. However,
multi-gem5 re-implements functionalists that are already in gem5.

This argument holds only if both implementations are correct (robust). It still seems to me that pd-gem5 does not provide correctness for the synchronization/checkpointing parts.


* Integrating with gem5 mainstream:

pd-gem5 launch script is written in python which is suited for integration
with gem5 python scripts. However multi-gem5 uses bash script. Also, all
source files in pd-gem5 are already parts of gem5 mainstream. However
multi-gem5 has tcp_server.cc/hh that is a standalone process and cannot be
part of gem5.

The multi-gem5 launch script is simply enough to rely only on the shell. It can obviously be easily re-written in python if that added any value. The tcp_server component is only a utility (like the ‘m5’ utility that is also part of gem5).

Cheers,
- Gabor



On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham <***@arm.com<mailto:***@arm.com>>
wrote:

Hello everyone,
We have taken a look at how pd-gem5 compares with multi-gem5. While
intending
to deliver the same functionality, there are some crucial differences:

* Synchronization.

pd-gem5 implements this in Python (not a problem in itself;
aesthetically
this is nice, but...). The issue is that pd-gem5's data packets and
barrier messages travel over different sockets. Since pd-gem5 could
see
data packets passing synchronization barriers, it could create an
inconsistent checkpoint.

multi-gem5's synchronization is implemented in C++ using sync events,
but
more importantly, the messages queue up in the same stream and so
cannot
have the issue just described. (Event ordering is often crucial in
snapshot protocols.) Therefore we feel that multi-gem5 is a more
robust
solution in this respect.

* Packet handling.

pd-gem5 uses EtherTap for data packets but changed the polling
mechanism
to go through the main event queue. Since this rate is actually
linked
with simulator progress, it cannot guarantee that the packets are
serviced
at regular intervals of real time. This can lead to packets
queueing up
which would contribute to the synchronization issues mentioned above.

multi-gem5 uses plain sockets with separate receive threads and so
does
not
have this issue.

* Checkpoint accuracy.

A user would like to have a checkpoint at precisely the time the
'm5 checkpoint' operation is executed so as to not miss any of the
area of interest in his application.

pd-gem5 requires that simulation finish the current quantum
before checkpointing, so it cannot provide this.

(Shortening the quantum can help, but usually the snapshot is being
taken
while 'fast-forwarding', i.e. simulating as fast as possible, which
would
motivate a longer quantum.)

multi-gem5 can enter the drain cycle immediately upon receiving a
checkpoint request. We find this accuracy highly desirable.

* Implementation of network topology.

pd-gem5 uses a separate gem5 process to act as a switch whereas
multi-gem5
uses a standalone packet relay process.

We haven't measured the overhead of pd-gem5's simulated switch yet,
but
we're confident that our approach is at least as fast and more
scalable.


Thanks,
Curtis
________________________________________
From: gem5-dev [gem5-dev-***@gem5.org<mailto:gem5-dev-***@gem5.org>] On Behalf Of Mohammad Alian [
***@wisc.edu<mailto:***@wisc.edu>]
Sent: Friday, June 26, 2015 7:37 PM
To: gem5 Developer List
Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
system
on multiple physical hosts

Hi Anthony,

I think that would be a good option, then I can add pd-gem5
functionality
on top of that. Right now I've simplified your implementation. Also, I
think I had found some bugs in your patch that I cannot remember now. If
you decided to ship EtherSwitch patch, let me know to give you a review
on
that.

Thanks,
Mohammad

On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
***@amd.com<mailto:***@amd.com>> wrote:

> Would it make sense for me to ship the EtherSwitch patch first, since
it
> has utility on its own, and then we can decide which of the
"multi-gem5"
> approaches is best, or if it's some combination of both?
>
> The only reason I never shipped it was because Steve raised an issue
that
> I didn't have a good alternative for, and didn't have the time to look
into
> one at that time.
> ________________________________________
> From: gem5-dev [gem5-dev-***@gem5.org<mailto:gem5-dev-***@gem5.org>] on behalf of Mohammad
Alian [
> ***@wisc.edu<mailto:***@wisc.edu>]
> Sent: Wednesday, June 24, 2015 12:43 PM
> To: gem5 Developer List
> Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
system
> on multiple physical hosts
>
> Hi Andreas,
>
> Thanks for the comment.
> I think the checkpointing support in both works is the same. Here is
how
> checkpointing support is implemented in pd-gem5:
>
> Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> instruction, it will send a “recv-ckpt” signal to the
> “barrier” process. Then the “barrier” process sends a “take-ckpt”
signal
to
> all the simulated nodes
> (including the node that encountered m5-checkpoint) at the end of the
> current simulation quantum. On the reception of
> “take-ckpt” signal, gem5 processes start dumping check-points. This
makes
> each simulated node dump a checkpoint
> at the same simulated time point while ensuring there is no in-flight
> packets.
>
> I believe this is the same as multi-gem5 patch approach for checkpoint
> support (based on the commit message of
http://reviews.gem5.org/r/2865/
).
> Also, we have tested our mechanism with several benchmarks and it
works.
As
> Steve suggested, I'll look into Curtis's patch and try to review it as
> well.
> But as Nilay also mentioned earlier, there are some codes missing in
> Curtis's patch. I prefer to first run multi-gem5 before starting to
review
> it.
>
> Thank you,
> Mohammad
>
> On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
***@arm.com<mailto:***@arm.com>>
> wrote:
>
> > Hi Steve,
> >
> > Apologies for the confusion. We are on the same page. My point is
that
we
> > cannot simply take a little bit of patch A and a little bit of
patch B.
> > This change involves a lot of code, and we need to approach this in
a
> > structured fashion. My proposal is to do it bottom up, and start by
> > getting the basic support in place. Since
> http://reviews.gem5.org/r/2826/
> > has already been on the review board for a few months, I am merely
> > suggesting that the it would be a good start to relate the newly
posted
> > patches to what is already there.
> >
> > Andreas
> >
> >
> >
> > On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> > <gem5-dev-***@gem5.org<mailto:gem5-dev-***@gem5.org> on behalf of ***@gmail.com<mailto:***@gmail.com>> wrote:
> >
> > >Hi Andreas,
> > >
> > >I'm a little confused by your email---you say you're fundamentally
> opposed
> > >to looking at both patches and picking the best features, then you
point
> > >out that the patches Curtis posted have the feature of better
> > >checkpointing
> > >support so we should pick that :).
> > >
> > >Obviously we can't just pick patch A from Mohammad's set and patch
B
> from
> > >Curtis's set and expect them to work together, but I think that
having
> > >both
> > >sets of patches available and comparing and contrasting the two
> > >implementations should enable us to get to a single implementation
> that's
> > >the best of both. Someone will have to make the effort of
integrating
> the
> > >better ideas from one set into the other set to create a new
unified
set
> > >of
> > >patches; (or maybe we commit one set and then integrate the best of
the
> > >other set as patches on top of that), but the first step is to
identify
> > >what "the best of both" is. Having Mohammad look at Curtis's
patches,
> and
> > >Curtis (or someone else from ARM) closely examine Mohammad's
patches
> would
> > >be a great start. I intend to review them both, though
unfortunately
my
> > >time has been scarce lately---I'm hoping to squeeze that in later
this
> > >week.
> > >
> > >Once we've had a few people look at both, we can discuss the pros
and
> cons
> > >of each, then discuss the strategy for getting the best features
in.
So
> > >far I've heard that Mohammad's patches have a better network model
but
> the
> > >ARM patches have better checkpointing support; that seems like a
good
> > >start.
> > >
> > >Steve
> > >
> > >On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> ***@arm.com<mailto:***@arm.com>
> > >
> > >wrote:
> > >
> > >> Hi all,
> > >>
> > >> Great work. However, I fundamentally do not believe in the
approach
of
> > >> ‘letting reviewers pick the best features’. There is no way we
would
> > >>ever
> > >> get something working out if it. We need to get _one_ working
solution
> > >> here, and figure out how to best get there. I would propose to
do it
> > >> bottom up, starting with the basic multi-simulator instance
support,
> > >> checkpointing support, and then move on to the network between
the
> > >> simulator instances.
> > >>
> > >> Thus, I propose we go with the low-level plumbing and checkpoint
> support
> > >> from what Curtis has posted. I believe proper checkpointing
support
to
> > >>be
> > >> the most challenging, and from what I can tell this is far more
> limited
> > >>in
> > >> what you just posted Mohammad. Could you perhaps review Curtis
patches
> > >> based on your insights, and we can try and get these patches in
shape
> > >>and
> > >> committed asap.
> > >>
> > >> Once we have the baseline functionality in place, then we can
start
> > >> looking at the more elaborate network models.
> > >>
> > >> Does this sound reasonable?
> > >>
> > >> Thanks,
> > >>
> > >> Andreas
> > >>
> > >> On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> > >> <gem5-dev-***@gem5.org<mailto:gem5-dev-***@gem5.org> on behalf of ***@wisc.edu<mailto:***@wisc.edu>> wrote:
> > >>
> > >> >Hello All,
> > >> >
> > >> >I have submitted a chain of patches which enables gem5 to
simulate
a
> > >> >cluster on multiple physical hosts:
> > >> >
> > >> >http://reviews.gem5.org/r/2909/
> > >> >http://reviews.gem5.org/r/2910/
> > >> >http://reviews.gem5.org/r/2912/
> > >> >http://reviews.gem5.org/r/2913/
> > >> >http://reviews.gem5.org/r/2914/
<http://reviews.gem5.org/r/2914/>
> > >> >
> > >> >and a patch that contains run scripts for a simple experiment:
> > >> >http://reviews.gem5.org/r/2915/
> > >> >
> > >> >We have run several benchmarks using this infrastructure,
including
> NAS
> > >> >parallel benchmarks (MPI) and DCBench-hadoop
> > >> >(http://prof.ict.ac.cn/DCBench/),
> > >> >and would be happy to share scripts/diskimages.
> > >> >
> > >> >We call this *pd-gem5*. *pd-gem5 *functionality is more or less
the
> > >>same
> > >> >as
> > >> >Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
*network
> > >>model
> > >> >is
> > >> >more thorough; it also enables modeling different network
topologies.
> > >> >Having both set of changes together let reviewers to pick best
> features
> > >> >from both works.
> > >> >
> > >> >Thank you,
> > >> >Mohammad Alian
> > >> >_______________________________________________
> > >> >gem5-dev mailing list
> > >> >gem5-***@gem5.org<mailto:gem5-***@gem5.org>
> > >> >http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >>
> > >> -- IMPORTANT NOTICE: The contents of this email and any
attachments
> are
> > >> confidential and may also be privileged. If you are not the
intended
> > >> recipient, please notify the sender immediately and do not
disclose
> the
> > >> contents to any other person, use it for any purpose, or store or
copy
> > >>the
> > >> information in any medium. Thank you.
> > >>
> > >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
9NJ,
> > >> Registered in England & Wales, Company No: 2557590
> > >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
CB1
> > >>9NJ,
> > >> Registered in England & Wales, Company No: 2548782
> > >> _______________________________________________
> > >> gem5-dev mailing list
> > >> gem5-***@gem5.org<mailto:gem5-***@gem5.org>
> > >> http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >_______________________________________________
> > >gem5-dev mailing list
> > >gem5-***@gem5.org<mailto:gem5-***@gem5.org>
> > >http://m5sim.org/mailman/listinfo/gem5-dev
> >
> >
> > -- IMPORTANT NOTICE: The contents of this email and any attachments
are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose
the
> > contents to any other person, use it for any purpose, or store or
copy
> the
> > information in any medium. Thank you.
> >
> > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No: 2557590
> > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
9NJ,
> > Registered in England & Wales, Company No: 2548782
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-***@gem5.org<mailto:gem5-***@gem5.org>
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org<mailto:gem5-***@gem5.org>
> http://m5sim.org/mailman/listinfo/gem5-dev
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org<mailto:gem5-***@gem5.org>
> http://m5sim.org/mailman/listinfo/gem5-dev
>
_______________________________________________
gem5-dev mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/mailman/listinfo/gem5-dev

-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy
the
information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
9NJ,
Registered in England & Wales, Company No: 2548782

_______________________________________________
gem5-dev mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/mailman/listinfo/gem5-dev

_______________________________________________
gem5-dev mailing list
gem5-***@gem5.org<mailto:gem5-***@gem5.org>
http://m5sim.org/mailman/listinfo/gem5-dev


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Gabor Dozsa
2015-07-01 11:44:16 UTC
Permalink
Hi All,

Sorry for the missing indentation in my previous e-mail! (This was my
first e-mail to the dev-list so I could not simply use “reply"). Below is
the same message, hopefully in more readable form.

====================================

Hi All,

Thank you Mohammad for your elaboration on the issues!

I have written most of the multi-gem5 patch so let me add some more
clarifications and answer to your concerns. My comments are inline below.

Thanks,
- Gabor

On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:

>Hi All,
>
>Curtis-Thank you for listing some of the differences. I was waiting for
>the
>completed multi-gem5 patch before I send my review. Please see my inline
>response below. I¹ve addressed the concerns that you¹ve raised. Also, I¹ve
>added a bit more to the comparison.
>
>-* Synchronization.
>
>pd-gem5 implements this in Python (not a problem in itself; aesthetically
>
>this is nice, but...). The issue is that pd-gem5's data packets and
>
>barrier messages travel over different sockets. Since pd-gem5 could see
>
>data packets passing synchronization barriers, it could create an
>
>inconsistent checkpoint.
>
>multi-gem5's synchronization is implemented in C++ using sync events, but
>
>more importantly, the messages queue up in the same stream and so cannot
>
>have the issue just described. (Event ordering is often crucial in
>
>snapshot protocols.) Therefore we feel that multi-gem5 is a more robust
>
>solution in this respect.
>
>Each packet in pd-gem5 has a time-stamp. So even if data packets pass
>synchronization barriers (in another word data packets arrive early at the
>destination node), destination node process packets based on their
>timestamp. Actually allowing data packets to pass sync barriers is a nice
>feature that can reduce the likelihood of late packet reception. Ordering
>of data messages that flow over pd-gem5 nodes is also preserved in pd-gem5
>implementation.

This seems to be a misunderstanding. Maybe the wording was not precise
before.The problem is not a data packet that “passing" a sync barrier
but the other way around, a sync barrier that can pass a data packet
(e.g. while the data packet is waiting in the host operating system
socket layer). If that happens, the packet will arrive later than it was
supposed to and it may miss the computed receive tick.

For instance, let’s assume that the quantum coincides with the simulated
Ether link delay. (This is the optimal choice of quantum to minimize the
number of sync barriers.) If a data packet is sent right at the beginning
of a quantum then this packet must arrive at the destination gem5 process
within the same quantum in order not to miss its receive tick at the very
beginning of the next quantum. If the sync barrier can pass the data packet
then the data packet may arrive only during the next quantum (or in
extreme conditions even later than that) so when it arrives the receiver
gem5 may pass already the receive tick.

Time-stamping does help with this issue. Also, if a data packet is waiting
in the host operating system socket layer when the simulation thread exits
to python to complete the next sync barrier then the packet will not go
into the checkpoint that may follow that sync barrier.

>
>What you mentioned as an advantage for multi-gem5 is actually a key
>disadvantage: buffering sync messages behind data packets can add up to
>the
>synchronization overhead and slow down simulation significantly.

The purpose of sync messages is to make sure that the data packets arrive
in time (in terms of simulated time) at the destination so they can be
scheduled for being received at the proper computed tick. Sync messages
also make sure that no data packets are in flight when a sync barrier
completes before we take a checkpoint. They definitely add overhead for
the simulation but they are necessary for the correctness of the
simulation.

The receive thread in multi-gem5 reads out packets from the socket in
parallel with the simulation thread so packets normally will not be
"queueing up” before a sync barrier message. There is definitely room
for improvements in the current implementation for reducing the
synchronization overhead but that is likely true for pd-gem5, too.
The important thing here is that the solution must provide correctness
(robustness) first.

>Also,
>multi-gem5 send huge sized messages (multiHeaderPkt) through network to
>perform each synchronization point, which increases synchronization
>overhead further. In pd-gem5, we choose to send just one character as sync
>message through a separate socket to reduce synchronization overhead.

The TCP/IP message size is unlikely the bottleneck here. Multi-gem5 will
send ~50 bytes more in a sync barrier message than pd-gem5 but that bigger
sync message still fits into a single ethernet frame on the wire. The
end-to-end latency overhead that is caused by 50 bytes extra payload for
a small single frame TCP/IP message is likely to fall into the “noise"
category if one tries to measure it in a real cluster.

>
>* Packet handling.
>
>pd-gem5 uses EtherTap for data packets but changed the polling mechanism
>
>to go through the main event queue. Since this rate is actually linked
>
>with simulator progress, it cannot guarantee that the packets are
>serviced
>
>at regular intervals of real time. This can lead to packets queueing up
>
>which would contribute to the synchronization issues mentioned above.
>
>multi-gem5 uses plain sockets with separate receive threads and so does
>not
>
>have this issue.
>
>I think again you are pointing to your first concern that I¹ve explained
>above. Packets that have queued up in EtherTap socket, will be processed
>and delivered to simulation environment at the beginning of next
>simulation
>quantum.
>
>Please notice that multi-gem5 introduces a new simObjects to interface
>simulation environment to real world which is redundant. This
>functionality
>is already there by EtherTap.

Except that the EtherTap solution does not provide a correct (robust)
solution for the synchronization problem.

>
>* Checkpoint accuracy.
>
>A user would like to have a checkpoint at precisely the time the
>
>'m5 checkpoint' operation is executed so as to not miss any of the
>
>area of interest in his application.
>
>pd-gem5 requires that simulation finish the current quantum
>
>before checkpointing, so it cannot provide this.
>
>(Shortening the quantum can help, but usually the snapshot is being taken
>
>while 'fast-forwarding', i.e. simulating as fast as possible, which would
>
>motivate a longer quantum.)
>
>multi-gem5 can enter the drain cycle immediately upon receiving a
>
>checkpoint request. We find this accuracy highly desirable.
>
>It¹s true that if you have a large quantum size then there would be some
>discrepancy between the m5_ckpt instruction tick and the actual dump tick.
>Based on multi-gem5 code, my understanding is that you send async
>checkpoint message as soon as one of the gem5 processes encounter m5_ckpt
>instruction. But I¹m not sure how you fix the aforementioned issue,
>because
>you have to sync all gem5 processes before you start dumping checkpoint,
>which necessitate a global synchronization beforehand.

In multi-gem5, the gem5 process who encounters the m5_ckpt instruction
sends out an async checkpoint notification for the peer gem5 processes and
then it starts the draining immediately (at the same tick). So the
checkpoint will be taken at the exact tick form the initiator process
point of view. The global synchronisation with the peer processes takes
place while the initiator process is still waiting at the same tick (i.e
the simulation thread is suspended). However, the receiver thread
Continues reading out the socket - while waiting for the global sync to
complete- to make sure that in-flight data packets from peer gem5 processes
are stored properly and saved into the checkpoint.

>
>By the way, we have a fix for this issue by introducing a new m5 pseudo
>instruction.

I fail to see how a new pseudo instruction can solve the problem of
completing the full quantum in pd-gem5 before a checkpoint can be taken.
Could you please elaborate on that?

>
>* Implementation of network topology.
>
>pd-gem5 uses a separate gem5 process to act as a switch whereas multi-gem5
>
>uses a standalone packet relay process.
>
>We haven't measured the overhead of pd-gem5's simulated switch yet, but
>
>we're confident that our approach is at least as fast and more scalable.
>
>There is this flexibility in pd-gem5 to simulate a switch box alongside
>one
>of the other gem5 processes. However, it might make that gem5 process the
>simulation bottleneck. One of the advantages of pd-gem5 over multi-gem5 is
>that we use gem5 to simulate a switch box, which allows us to model any
>network topology by instantiating several Switch simObjects and
>interconnect them with EhterLink in an arbitrary fashion. A standalone tcp
>server just can provide switch functionality (forwarding packets to
>destinations) and model a star network topology. Furthermore, it cannot
>model various network timings such as queueing delay, congestion, and
>routing latency. Also it has some accuracy issues that I will point out
>next.

I agree with the complex topology argument. We already mentioned that
before as an advantage for pd-gem5 from the point of view of future
extensions. However, I do not agree that multi-gem5 cannot model queueing
delays and congestions. For a simple crossbar switch, it can model queueing
delays and congestions, but the receive queues are distributed among the
gem5 processes.

>
>* Broken network timing:
>
>Forwarding packets between gem5 processes using a standalone tcp server
>can
>cause reordering between packets that have different source but same
>destination. It causes inaccurate network timing and worse of all
>non-deterministic simulation. pd-gem5 resolve this by reordering packets
>at
>Switch process and then send them to their destination (it¹s possible as
>switch is synchronized with the rest of the nodes).

In multi-gem5, there is always a HeaderPkt that contains some meta
information for each data packet. The meta information include the send
tick and the sender rank (i.e. a unique ID of the sender gem5 process).
We use those information to define a well defined ordering of packets even
if packets are arriving at the same receiver from different senders. This
packet ordering scheme is still being tested so the corresponding patch is
not on the RB yet.

>
>* Amount of changes
>
>pd-gem5 introduce different modes in etherlink just to provide accurate
>timing for each component in the network subsystem (NIC, link, switch) as
>well as capability of modeling different network topologies (mesh, ring,
>fat tree, etc). To enable a simple functionality, like what multi-gem5
>provides, the amount of changes in gem5 can be limited to time-stamping
>packets and providing synchronization through python scripts. However,
>multi-gem5 re-implements functionalists that are already in gem5.

This argument holds only if both implementations are correct (robust). It
still seems to me that pd-gem5 does not provide correctness for the
synchronization/checkpointing parts.

>
>* Integrating with gem5 mainstream:
>
>pd-gem5 launch script is written in python which is suited for integration
>with gem5 python scripts. However multi-gem5 uses bash script. Also, all
>source files in pd-gem5 are already parts of gem5 mainstream. However
>multi-gem5 has tcp_server.cc/hh that is a standalone process and cannot be
>part of gem5.

The multi-gem5 launch script is simply enough to rely only on the shell. It
can obviously be easily re-written in python if that added any value. The
tcp_server component is only a utility (like the "m5" utility that is also
part of gem5).


Cheers,
- Gabor


>On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham <***@arm.com>
>wrote:
>
>>Hello everyone,
>>We have taken a look at how pd-gem5 compares with multi-gem5. While
>>intending
>>to deliver the same functionality, there are some crucial differences:
>>
>>* Synchronization.
>>
>> pd-gem5 implements this in Python (not a problem in itself;
>>aesthetically
>> this is nice, but...). The issue is that pd-gem5's data packets and
>> barrier messages travel over different sockets. Since pd-gem5 could
>>see
>> data packets passing synchronization barriers, it could create an
>> inconsistent checkpoint.
>>
>> multi-gem5's synchronization is implemented in C++ using sync events,
>>but
>> more importantly, the messages queue up in the same stream and so
>>cannot
>> have the issue just described. (Event ordering is often crucial in
>> snapshot protocols.) Therefore we feel that multi-gem5 is a more
>>robust
>> solution in this respect.
>>
>>* Packet handling.
>>
>> pd-gem5 uses EtherTap for data packets but changed the polling
>>mechanism
>> to go through the main event queue. Since this rate is actually
>>linked
>> with simulator progress, it cannot guarantee that the packets are
>>serviced
>> at regular intervals of real time. This can lead to packets
>>queueing up
>> which would contribute to the synchronization issues mentioned above.
>>
>> multi-gem5 uses plain sockets with separate receive threads and so
>>does
>>not
>> have this issue.
>>
>>* Checkpoint accuracy.
>>
>> A user would like to have a checkpoint at precisely the time the
>> 'm5 checkpoint' operation is executed so as to not miss any of the
>> area of interest in his application.
>>
>> pd-gem5 requires that simulation finish the current quantum
>> before checkpointing, so it cannot provide this.
>>
>> (Shortening the quantum can help, but usually the snapshot is being
>>taken
>> while 'fast-forwarding', i.e. simulating as fast as possible, which
>>would
>> motivate a longer quantum.)
>>
>> multi-gem5 can enter the drain cycle immediately upon receiving a
>> checkpoint request. We find this accuracy highly desirable.
>>
>>* Implementation of network topology.
>>
>> pd-gem5 uses a separate gem5 process to act as a switch whereas
>>multi-gem5
>> uses a standalone packet relay process.
>>
>> We haven't measured the overhead of pd-gem5's simulated switch yet,
>>but
>> we're confident that our approach is at least as fast and more
>>scalable.
>>
>>
>>Thanks,
>>Curtis
>>________________________________________
>>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad Alian [
>>***@wisc.edu]
>>Sent: Friday, June 26, 2015 7:37 PM
>>To: gem5 Developer List
>>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
>>system
>>on multiple physical hosts
>>
>>Hi Anthony,
>>
>>I think that would be a good option, then I can add pd-gem5
>>functionality
>>on top of that. Right now I've simplified your implementation. Also, I
>>think I had found some bugs in your patch that I cannot remember now. If
>>you decided to ship EtherSwitch patch, let me know to give you a review
>>on
>>that.
>>
>>Thanks,
>>Mohammad
>>
>>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
>>***@amd.com> wrote:
>>
>>>Would it make sense for me to ship the EtherSwitch patch first, since
>>it
>>>has utility on its own, and then we can decide which of the
>>"multi-gem5"
>>>approaches is best, or if it's some combination of both?
>>>
>>>The only reason I never shipped it was because Steve raised an issue
>>that
>>>I didn't have a good alternative for, and didn't have the time to look
>>into
>>>one at that time.
>>>________________________________________
>>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
>>Alian [
>>>***@wisc.edu]
>>>Sent: Wednesday, June 24, 2015 12:43 PM
>>>To: gem5 Developer List
>>>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
>>system
>>>on multiple physical hosts
>>>
>>>Hi Andreas,
>>>
>>>Thanks for the comment.
>>>I think the checkpointing support in both works is the same. Here is
>>how
>>>checkpointing support is implemented in pd-gem5:
>>>
>>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
>>>instruction, it will send a ³recv-ckpt² signal to the
>>>³barrier² process. Then the ³barrier² process sends a ³take-ckpt²
>>signal
>>to
>>>all the simulated nodes
>>>(including the node that encountered m5-checkpoint) at the end of the
>>>current simulation quantum. On the reception of
>>>³take-ckpt² signal, gem5 processes start dumping check-points. This
>>makes
>>>each simulated node dump a checkpoint
>>>at the same simulated time point while ensuring there is no in-flight
>>>packets.
>>>
>>>I believe this is the same as multi-gem5 patch approach for checkpoint
>>>support (based on the commit message of
>>http://reviews.gem5.org/r/2865/
>>).
>>>Also, we have tested our mechanism with several benchmarks and it
>>works.
>>As
>>>Steve suggested, I'll look into Curtis's patch and try to review it as
>>>well.
>>>But as Nilay also mentioned earlier, there are some codes missing in
>>>Curtis's patch. I prefer to first run multi-gem5 before starting to
>>review
>>>it.
>>>
>>>Thank you,
>>>Mohammad
>>>
>>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
>>***@arm.com>
>>>wrote:
>>>
>>>>Hi Steve,
>>>>
>>>>Apologies for the confusion. We are on the same page. My point is
>>that
>>we
>>>>cannot simply take a little bit of patch A and a little bit of
>>patch B.
>>>>This change involves a lot of code, and we need to approach this in
>>a
>>>>structured fashion. My proposal is to do it bottom up, and start by
>>>>getting the basic support in place. Since
>>>http://reviews.gem5.org/r/2826/
>>>>has already been on the review board for a few months, I am merely
>>>>suggesting that the it would be a good start to relate the newly
>>posted
>>>>patches to what is already there.
>>>>
>>>>Andreas
>>>>
>>>>
>>>>
>>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
>>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
>>>>
>>>>>Hi Andreas,
>>>>>
>>>>>I'm a little confused by your email---you say you're fundamentally
>>>opposed
>>>>>to looking at both patches and picking the best features, then you
>>point
>>>>>out that the patches Curtis posted have the feature of better
>>>>>checkpointing
>>>>>support so we should pick that :).
>>>>>
>>>>>Obviously we can't just pick patch A from Mohammad's set and patch
>>B
>>>from
>>>>>Curtis's set and expect them to work together, but I think that
>>having
>>>>>both
>>>>>sets of patches available and comparing and contrasting the two
>>>>>implementations should enable us to get to a single implementation
>>>that's
>>>>>the best of both. Someone will have to make the effort of
>>integrating
>>>the
>>>>>better ideas from one set into the other set to create a new
>>unified
>>set
>>>>>of
>>>>>patches; (or maybe we commit one set and then integrate the best of
>>the
>>>>>other set as patches on top of that), but the first step is to
>>identify
>>>>>what "the best of both" is. Having Mohammad look at Curtis's
>>patches,
>>>and
>>>>>Curtis (or someone else from ARM) closely examine Mohammad's
>>patches
>>>would
>>>>>be a great start. I intend to review them both, though
>>unfortunately
>>my
>>>>>time has been scarce lately---I'm hoping to squeeze that in later
>>this
>>>>>week.
>>>>>
>>>>>Once we've had a few people look at both, we can discuss the pros
>>and
>>>cons
>>>>>of each, then discuss the strategy for getting the best features
>>in.
>>So
>>>>>far I've heard that Mohammad's patches have a better network model
>>but
>>>the
>>>>>ARM patches have better checkpointing support; that seems like a
>>good
>>>>>start.
>>>>>
>>>>>Steve
>>>>>
>>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
>>>***@arm.com
>>>>>
>>>>>wrote:
>>>>>
>>>>>>Hi all,
>>>>>>
>>>>>>Great work. However, I fundamentally do not believe in the
>>approach
>>of
>>>>>>Œletting reviewers pick the best features¹. There is no way we
>>would
>>>>>>ever
>>>>>>get something working out if it. We need to get _one_ working
>>solution
>>>>>>here, and figure out how to best get there. I would propose to
>>do it
>>>>>>bottom up, starting with the basic multi-simulator instance
>>support,
>>>>>>checkpointing support, and then move on to the network between
>>the
>>>>>>simulator instances.
>>>>>>
>>>>>>Thus, I propose we go with the low-level plumbing and checkpoint
>>>support
>>>>>>from what Curtis has posted. I believe proper checkpointing
>>support
>>to
>>>>>>be
>>>>>>the most challenging, and from what I can tell this is far more
>>>limited
>>>>>>in
>>>>>>what you just posted Mohammad. Could you perhaps review Curtis
>>patches
>>>>>>based on your insights, and we can try and get these patches in
>>shape
>>>>>>and
>>>>>>committed asap.
>>>>>>
>>>>>>Once we have the baseline functionality in place, then we can
>>start
>>>>>>looking at the more elaborate network models.
>>>>>>
>>>>>>Does this sound reasonable?
>>>>>>
>>>>>>Thanks,
>>>>>>
>>>>>>Andreas
>>>>>>
>>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
>>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>>>>>>
>>>>>>>Hello All,
>>>>>>>
>>>>>>>I have submitted a chain of patches which enables gem5 to
>>simulate
>>a
>>>>>>>cluster on multiple physical hosts:
>>>>>>>
>>>>>>>http://reviews.gem5.org/r/2909/
>>>>>>>http://reviews.gem5.org/r/2910/
>>>>>>>http://reviews.gem5.org/r/2912/
>>>>>>>http://reviews.gem5.org/r/2913/
>>>>>>>http://reviews.gem5.org/r/2914/
>><http://reviews.gem5.org/r/2914/>
>>>>>>>
>>>>>>>and a patch that contains run scripts for a simple experiment:
>>>>>>>http://reviews.gem5.org/r/2915/
>>>>>>>
>>>>>>>We have run several benchmarks using this infrastructure,
>>including
>>>NAS
>>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
>>>>>>>(http://prof.ict.ac.cn/DCBench/),
>>>>>>>and would be happy to share scripts/diskimages.
>>>>>>>
>>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or less
>>the
>>>>>>same
>>>>>>>as
>>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
>>*network
>>>>>>model
>>>>>>>is
>>>>>>>more thorough; it also enables modeling different network
>>topologies.
>>>>>>>Having both set of changes together let reviewers to pick best
>>>features
>>>>>>>from both works.
>>>>>>>
>>>>>>>Thank you,
>>>>>>>Mohammad Alian
>>>>>>>_______________________________________________
>>>>>>>gem5-dev mailing list
>>>>>>>gem5-***@gem5.org
>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>>>
>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>>attachments
>>>are
>>>>>>confidential and may also be privileged. If you are not the
>>intended
>>>>>>recipient, please notify the sender immediately and do not
>>disclose
>>>the
>>>>>>contents to any other person, use it for any purpose, or store or
>>copy
>>>>>>the
>>>>>>information in any medium. Thank you.
>>>>>>
>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>>>>>>Registered in England & Wales, Company No: 2557590
>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>>CB1
>>>>>>9NJ,
>>>>>>Registered in England & Wales, Company No: 2548782
>>>>>>_______________________________________________
>>>>>>gem5-dev mailing list
>>>>>>gem5-***@gem5.org
>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>>_______________________________________________
>>>>>gem5-dev mailing list
>>>>>gem5-***@gem5.org
>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>>>
>>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
>>are
>>>>confidential and may also be privileged. If you are not the intended
>>>>recipient, please notify the sender immediately and do not disclose
>>the
>>>>contents to any other person, use it for any purpose, or store or
>>copy
>>>the
>>>>information in any medium. Thank you.
>>>>
>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>>>>Registered in England & Wales, Company No: 2557590
>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>>>>Registered in England & Wales, Company No: 2548782
>>>>_______________________________________________
>>>>gem5-dev mailing list
>>>>gem5-***@gem5.org
>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>>_______________________________________________
>>>gem5-dev mailing list
>>>gem5-***@gem5.org
>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>_______________________________________________
>>>gem5-dev mailing list
>>>gem5-***@gem5.org
>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>
>>_______________________________________________
>>gem5-dev mailing list
>>gem5-***@gem5.org
>>http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>-- IMPORTANT NOTICE: The contents of this email and any attachments are
>>confidential and may also be privileged. If you are not the intended
>>recipient, please notify the sender immediately and do not disclose the
>>contents to any other person, use it for any purpose, or store or copy
>>the
>>information in any medium. Thank you.
>>
>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>>Registered in England & Wales, Company No: 2557590
>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>>Registered in England & Wales, Company No: 2548782
>>
>>_______________________________________________
>>gem5-dev mailing list
>>gem5-***@gem5.org
>>http://m5sim.org/mailman/listinfo/gem5-dev
>>
>_______________________________________________
>gem5-dev mailing list
>gem5-***@gem5.org
>http://m5sim.org/mailman/listinfo/gem5-dev












-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Mohammad Alian
2015-07-02 04:41:56 UTC
Permalink
Thanks Gabor for the reply.

I feel this conversation is useful as we can find out pros/cons of each
design.
Please find my response in-lined below.

Thank you,
Mohammad

On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com> wrote:

> Hi All,
>
> Sorry for the missing indentation in my previous e-mail! (This was my
> first e-mail to the dev-list so I could not simply use “reply"). Below is
> the same message, hopefully in more readable form.
>
> ====================================
>
> Hi All,
>
> Thank you Mohammad for your elaboration on the issues!
>
> I have written most of the multi-gem5 patch so let me add some more
> clarifications and answer to your concerns. My comments are inline below.
>
> Thanks,
> - Gabor
>
> On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
>
> >Hi All,
> >
> >Curtis-Thank you for listing some of the differences. I was waiting for
> >the
> >completed multi-gem5 patch before I send my review. Please see my inline
> >response below. I¹ve addressed the concerns that you¹ve raised. Also, I¹ve
> >added a bit more to the comparison.
> >
> >-* Synchronization.
> >
> >pd-gem5 implements this in Python (not a problem in itself; aesthetically
> >
> >this is nice, but...). The issue is that pd-gem5's data packets and
> >
> >barrier messages travel over different sockets. Since pd-gem5 could see
> >
> >data packets passing synchronization barriers, it could create an
> >
> >inconsistent checkpoint.
> >
> >multi-gem5's synchronization is implemented in C++ using sync events, but
> >
> >more importantly, the messages queue up in the same stream and so cannot
> >
> >have the issue just described. (Event ordering is often crucial in
> >
> >snapshot protocols.) Therefore we feel that multi-gem5 is a more robust
> >
> >solution in this respect.
> >
> >Each packet in pd-gem5 has a time-stamp. So even if data packets pass
> >synchronization barriers (in another word data packets arrive early at the
> >destination node), destination node process packets based on their
> >timestamp. Actually allowing data packets to pass sync barriers is a nice
> >feature that can reduce the likelihood of late packet reception. Ordering
> >of data messages that flow over pd-gem5 nodes is also preserved in pd-gem5
> >implementation.
>
> This seems to be a misunderstanding. Maybe the wording was not precise
> before.The problem is not a data packet that “passing" a sync barrier
> but the other way around, a sync barrier that can pass a data packet
> (e.g. while the data packet is waiting in the host operating system
> socket layer). If that happens, the packet will arrive later than it was
> supposed to and it may miss the computed receive tick.
>
> For instance, let’s assume that the quantum coincides with the simulated
> Ether link delay. (This is the optimal choice of quantum to minimize the
> number of sync barriers.) If a data packet is sent right at the beginning
> of a quantum then this packet must arrive at the destination gem5 process
> within the same quantum in order not to miss its receive tick at the very
> beginning of the next quantum. If the sync barrier can pass the data packet
> then the data packet may arrive only during the next quantum (or in
> extreme conditions even later than that) so when it arrives the receiver
> gem5 may pass already the receive tick.
>
> This argument makes more sense than the previous one. Note that gem5 is a
cycle accurate simulator and it runs orders of magnitude slower that real
hardware. So it's almost impossible that the flight time of packet through
real network turns to be more that simulation time of one quantum. We ran a
set of experiments just for this purpose: with quantum size equal to
etherlink delay, we never got any late arrival violation (what you
described) for full NAS benchmarks suit (please refer to the paper).

multi-gem5 is optimized for a case that almost never happens! and
scarifying speedup for no gain.


> Time-stamping does help with this issue. Also, if a data packet is waiting
> in the host operating system socket layer when the simulation thread exits
> to python to complete the next sync barrier then the packet will not go
> into the checkpoint that may follow that sync barrier.
>
> That's a good point. Current pd-gem5 checkpointing mechanism might miss
packets that have been sent during previous quantum and are waiting in OS
socket buffer. I should add some code inside ethertap serialization
function to drain ethertap socket before writing checkpoint. I will update
pd-gem5 patch accordingly.

>
> >What you mentioned as an advantage for multi-gem5 is actually a key
> >disadvantage: buffering sync messages behind data packets can add up to
> >the
> >synchronization overhead and slow down simulation significantly.
>
> The purpose of sync messages is to make sure that the data packets arrive
> in time (in terms of simulated time) at the destination so they can be
> scheduled for being received at the proper computed tick. Sync messages
> also make sure that no data packets are in flight when a sync barrier
> completes before we take a checkpoint. They definitely add overhead for
> the simulation but they are necessary for the correctness of the
> simulation.
>
> The receive thread in multi-gem5 reads out packets from the socket in
> parallel with the simulation thread so packets normally will not be
> "queueing up” before a sync barrier message. There is definitely room
> for improvements in the current implementation for reducing the
> synchronization overhead but that is likely true for pd-gem5, too.
> The important thing here is that the solution must provide correctness
> (robustness) first.
>
> pd-gem5 provides correctness. Please read my previous comment. The whole
purpose of multi/pd-gem5 is to parallelize simulation with minimal overhead
and gain speedup. If you fail to do so, nobody will use your tool.


> >Also,
> >multi-gem5 send huge sized messages (multiHeaderPkt) through network to
> >perform each synchronization point, which increases synchronization
> >overhead further. In pd-gem5, we choose to send just one character as sync
> >message through a separate socket to reduce synchronization overhead.
>
> The TCP/IP message size is unlikely the bottleneck here. Multi-gem5 will
> send ~50 bytes more in a sync barrier message than pd-gem5 but that bigger
> sync message still fits into a single ethernet frame on the wire. The
> end-to-end latency overhead that is caused by 50 bytes extra payload for
> a small single frame TCP/IP message is likely to fall into the “noise"
> category if one tries to measure it in a real cluster.
>
> You should prove your hypothesis experimentally. Each gem5 process
send/receive sync messages at the end of every quantum. Say you are
simulating "N" node computer cluster with "M" different configuration. Then
you will have N*M gem5 processes that send/receive these 50 Bytes (it think
it's more) extra data at the same time over network ...

Furthermore, multi-gem5 send a header before each data message. Comparing
with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12 least
significant digits of the Tick) to each data packet. I don't know exactly
how large are these "MultiHeaderPkt", but it just has two Tick field that
each is 64 Bytes! Also, header packets are separate TCP packets, so you pay
for sending two separate packets for each data packet. And worst, you
serialize all of these with sync messages.


> >
> >* Packet handling.
> >
> >pd-gem5 uses EtherTap for data packets but changed the polling mechanism
> >
> >to go through the main event queue. Since this rate is actually linked
> >
> >with simulator progress, it cannot guarantee that the packets are
> >serviced
> >
> >at regular intervals of real time. This can lead to packets queueing up
> >
> >which would contribute to the synchronization issues mentioned above.
> >
> >multi-gem5 uses plain sockets with separate receive threads and so does
> >not
> >
> >have this issue.
> >
> >I think again you are pointing to your first concern that I¹ve explained
> >above. Packets that have queued up in EtherTap socket, will be processed
> >and delivered to simulation environment at the beginning of next
> >simulation
> >quantum.
> >
> >Please notice that multi-gem5 introduces a new simObjects to interface
> >simulation environment to real world which is redundant. This
> >functionality
> >is already there by EtherTap.
>
> Except that the EtherTap solution does not provide a correct (robust)
> solution for the synchronization problem.
>
> Please read my first/second comments.


> >
> >* Checkpoint accuracy.
> >
> >A user would like to have a checkpoint at precisely the time the
> >
> >'m5 checkpoint' operation is executed so as to not miss any of the
> >
> >area of interest in his application.
> >
> >pd-gem5 requires that simulation finish the current quantum
> >
> >before checkpointing, so it cannot provide this.
> >
> >(Shortening the quantum can help, but usually the snapshot is being taken
> >
> >while 'fast-forwarding', i.e. simulating as fast as possible, which would
> >
> >motivate a longer quantum.)
> >
> >multi-gem5 can enter the drain cycle immediately upon receiving a
> >
> >checkpoint request. We find this accuracy highly desirable.
> >
> >It¹s true that if you have a large quantum size then there would be some
> >discrepancy between the m5_ckpt instruction tick and the actual dump tick.
> >Based on multi-gem5 code, my understanding is that you send async
> >checkpoint message as soon as one of the gem5 processes encounter m5_ckpt
> >instruction. But I¹m not sure how you fix the aforementioned issue,
> >because
> >you have to sync all gem5 processes before you start dumping checkpoint,
> >which necessitate a global synchronization beforehand.
>
> In multi-gem5, the gem5 process who encounters the m5_ckpt instruction
> sends out an async checkpoint notification for the peer gem5 processes and
> then it starts the draining immediately (at the same tick). So the
> checkpoint will be taken at the exact tick form the initiator process
> point of view. The global synchronisation with the peer processes takes
> place while the initiator process is still waiting at the same tick (i.e
> the simulation thread is suspended). However, the receiver thread
> Continues reading out the socket - while waiting for the global sync to
> complete- to make sure that in-flight data packets from peer gem5 processes
> are stored properly and saved into the checkpoint.
>
>
So you mean multi-gem5 ends up with having gem5 processes with different
ticks after checkpoint? In pd-gem5 we make sure that all gem5 processes
start dumping checkpoint at the same tick. Are you sure that this is
correct to have each gem5 process dump checkpoint at different ticks???

I don't think this a correct checkpointing design. However, if you feel it
is correct, I can change a couple of lines in "Simulation.py" and barrier
scripts to implement the same functionality in pd-gem5. One thing that you
are obsessed about is to make sure that there is no in-flight packets while
we start dumping checkpoint, and you have all these complex mechanisms in
place to insure that! I think you can 99.99999% make sure that there is no
in-flight packet by waiting for 1 second after all gem5 processes finished
their quantum simulation and then dump checkpoint. Do you really think that
delivering a tcp packet would take more than 1 second in today's systems!?
Always go for simple solutions ...



> >
> >By the way, we have a fix for this issue by introducing a new m5 pseudo
> >instruction.
>
> I fail to see how a new pseudo instruction can solve the problem of
> completing the full quantum in pd-gem5 before a checkpoint can be taken.
> Could you please elaborate on that?
>
> As we take checkpoint while fast-forwarding and it is likely that we relax
synchronization for speedup purpose, a new pseudo instruction that can set
quantum size (m5_qset) can be helpful. So, one can insert m5_qset in his
benchmark source code before entering ROI that contains m5_ckpt to decrease
quantum size beforehand and reduce the discrepancy between m5_ckpt tick and
actual checkpoint tick. This is not included in pd-gem5 patch right now.


> >
> >* Implementation of network topology.
> >
> >pd-gem5 uses a separate gem5 process to act as a switch whereas multi-gem5
> >
> >uses a standalone packet relay process.
> >
> >We haven't measured the overhead of pd-gem5's simulated switch yet, but
> >
> >we're confident that our approach is at least as fast and more scalable.
> >
> >There is this flexibility in pd-gem5 to simulate a switch box alongside
> >one
> >of the other gem5 processes. However, it might make that gem5 process the
> >simulation bottleneck. One of the advantages of pd-gem5 over multi-gem5 is
> >that we use gem5 to simulate a switch box, which allows us to model any
> >network topology by instantiating several Switch simObjects and
> >interconnect them with EhterLink in an arbitrary fashion. A standalone tcp
> >server just can provide switch functionality (forwarding packets to
> >destinations) and model a star network topology. Furthermore, it cannot
> >model various network timings such as queueing delay, congestion, and
> >routing latency. Also it has some accuracy issues that I will point out
> >next.
>
> I agree with the complex topology argument. We already mentioned that
> before as an advantage for pd-gem5 from the point of view of future
> extensions. However, I do not agree that multi-gem5 cannot model queueing
> delays and congestions. For a simple crossbar switch, it can model queueing
> delays and congestions, but the receive queues are distributed among the
> gem5 processes.
>
> It's true that you can model queuing delay of a simple crossbar by
distributing queues across gem5 processes (end points). But to be able to
do so you have to ensure the ordering of packets that you enqueue in the
distributed queues. It is almost impossible without a synchronized switch
box. You should have a reorder queue that reorders packets dynamically and
updates timing parameter for each packet as well. I don't know how much
progress have you had to ensure ordering scheme in multi-gem5 but you may
already realized that how complex and error prone it can be. This argument
is also related to my next argument for "Broken network timing".


> >
> >* Broken network timing:
> >
> >Forwarding packets between gem5 processes using a standalone tcp server
> >can
> >cause reordering between packets that have different source but same
> >destination. It causes inaccurate network timing and worse of all
> >non-deterministic simulation. pd-gem5 resolve this by reordering packets
> >at
> >Switch process and then send them to their destination (it¹s possible as
> >switch is synchronized with the rest of the nodes).
>
> In multi-gem5, there is always a HeaderPkt that contains some meta
> information for each data packet. The meta information include the send
> tick and the sender rank (i.e. a unique ID of the sender gem5 process).
> We use those information to define a well defined ordering of packets even
> if packets are arriving at the same receiver from different senders. This
> packet ordering scheme is still being tested so the corresponding patch is
> not on the RB yet.
>
> Please read my previous comment. The most important part of multi/pd-gem5
extension is ensuring accurate and deterministic simulation.


> >
> >* Amount of changes
> >
> >pd-gem5 introduce different modes in etherlink just to provide accurate
> >timing for each component in the network subsystem (NIC, link, switch) as
> >well as capability of modeling different network topologies (mesh, ring,
> >fat tree, etc). To enable a simple functionality, like what multi-gem5
> >provides, the amount of changes in gem5 can be limited to time-stamping
> >packets and providing synchronization through python scripts. However,
> >multi-gem5 re-implements functionalists that are already in gem5.
>
> This argument holds only if both implementations are correct (robust). It
> still seems to me that pd-gem5 does not provide correctness for the
> synchronization/checkpointing parts.
>
> Again, please read my first comment for correctness of pd-gem5.


> >
> >* Integrating with gem5 mainstream:
> >
> >pd-gem5 launch script is written in python which is suited for integration
> >with gem5 python scripts. However multi-gem5 uses bash script. Also, all
> >source files in pd-gem5 are already parts of gem5 mainstream. However
> >multi-gem5 has tcp_server.cc/hh that is a standalone process and cannot
> be
> >part of gem5.
>
> The multi-gem5 launch script is simply enough to rely only on the shell. It
> can obviously be easily re-written in python if that added any value. The
> tcp_server component is only a utility (like the "m5" utility that is also
> part of gem5).
>
> The thing is that it's more likely that users want to add some
functionality to the run-script of multi/pd-gem5. E.g. pd-gem5 run-script
supports launching simulations using a simulation pool management software (
http://research.cs.wisc.edu/htcondor/). Using python enables users to
easily add these kind of supports.


>
> Cheers,
> - Gabor
>
>
> >On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham <***@arm.com>
> >wrote:
> >
> >>Hello everyone,
> >>We have taken a look at how pd-gem5 compares with multi-gem5. While
> >>intending
> >>to deliver the same functionality, there are some crucial differences:
> >>
> >>* Synchronization.
> >>
> >> pd-gem5 implements this in Python (not a problem in itself;
> >>aesthetically
> >> this is nice, but...). The issue is that pd-gem5's data packets and
> >> barrier messages travel over different sockets. Since pd-gem5 could
> >>see
> >> data packets passing synchronization barriers, it could create an
> >> inconsistent checkpoint.
> >>
> >> multi-gem5's synchronization is implemented in C++ using sync events,
> >>but
> >> more importantly, the messages queue up in the same stream and so
> >>cannot
> >> have the issue just described. (Event ordering is often crucial in
> >> snapshot protocols.) Therefore we feel that multi-gem5 is a more
> >>robust
> >> solution in this respect.
> >>
> >>* Packet handling.
> >>
> >> pd-gem5 uses EtherTap for data packets but changed the polling
> >>mechanism
> >> to go through the main event queue. Since this rate is actually
> >>linked
> >> with simulator progress, it cannot guarantee that the packets are
> >>serviced
> >> at regular intervals of real time. This can lead to packets
> >>queueing up
> >> which would contribute to the synchronization issues mentioned above.
> >>
> >> multi-gem5 uses plain sockets with separate receive threads and so
> >>does
> >>not
> >> have this issue.
> >>
> >>* Checkpoint accuracy.
> >>
> >> A user would like to have a checkpoint at precisely the time the
> >> 'm5 checkpoint' operation is executed so as to not miss any of the
> >> area of interest in his application.
> >>
> >> pd-gem5 requires that simulation finish the current quantum
> >> before checkpointing, so it cannot provide this.
> >>
> >> (Shortening the quantum can help, but usually the snapshot is being
> >>taken
> >> while 'fast-forwarding', i.e. simulating as fast as possible, which
> >>would
> >> motivate a longer quantum.)
> >>
> >> multi-gem5 can enter the drain cycle immediately upon receiving a
> >> checkpoint request. We find this accuracy highly desirable.
> >>
> >>* Implementation of network topology.
> >>
> >> pd-gem5 uses a separate gem5 process to act as a switch whereas
> >>multi-gem5
> >> uses a standalone packet relay process.
> >>
> >> We haven't measured the overhead of pd-gem5's simulated switch yet,
> >>but
> >> we're confident that our approach is at least as fast and more
> >>scalable.
> >>
> >>
> >>Thanks,
> >>Curtis
> >>________________________________________
> >>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad Alian [
> >>***@wisc.edu]
> >>Sent: Friday, June 26, 2015 7:37 PM
> >>To: gem5 Developer List
> >>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
> >>system
> >>on multiple physical hosts
> >>
> >>Hi Anthony,
> >>
> >>I think that would be a good option, then I can add pd-gem5
> >>functionality
> >>on top of that. Right now I've simplified your implementation. Also, I
> >>think I had found some bugs in your patch that I cannot remember now. If
> >>you decided to ship EtherSwitch patch, let me know to give you a review
> >>on
> >>that.
> >>
> >>Thanks,
> >>Mohammad
> >>
> >>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
> >>***@amd.com> wrote:
> >>
> >>>Would it make sense for me to ship the EtherSwitch patch first, since
> >>it
> >>>has utility on its own, and then we can decide which of the
> >>"multi-gem5"
> >>>approaches is best, or if it's some combination of both?
> >>>
> >>>The only reason I never shipped it was because Steve raised an issue
> >>that
> >>>I didn't have a good alternative for, and didn't have the time to look
> >>into
> >>>one at that time.
> >>>________________________________________
> >>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
> >>Alian [
> >>>***@wisc.edu]
> >>>Sent: Wednesday, June 24, 2015 12:43 PM
> >>>To: gem5 Developer List
> >>>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
> >>system
> >>>on multiple physical hosts
> >>>
> >>>Hi Andreas,
> >>>
> >>>Thanks for the comment.
> >>>I think the checkpointing support in both works is the same. Here is
> >>how
> >>>checkpointing support is implemented in pd-gem5:
> >>>
> >>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> >>>instruction, it will send a ³recv-ckpt² signal to the
> >>>³barrier² process. Then the ³barrier² process sends a ³take-ckpt²
> >>signal
> >>to
> >>>all the simulated nodes
> >>>(including the node that encountered m5-checkpoint) at the end of the
> >>>current simulation quantum. On the reception of
> >>>³take-ckpt² signal, gem5 processes start dumping check-points. This
> >>makes
> >>>each simulated node dump a checkpoint
> >>>at the same simulated time point while ensuring there is no in-flight
> >>>packets.
> >>>
> >>>I believe this is the same as multi-gem5 patch approach for checkpoint
> >>>support (based on the commit message of
> >>http://reviews.gem5.org/r/2865/
> >>).
> >>>Also, we have tested our mechanism with several benchmarks and it
> >>works.
> >>As
> >>>Steve suggested, I'll look into Curtis's patch and try to review it as
> >>>well.
> >>>But as Nilay also mentioned earlier, there are some codes missing in
> >>>Curtis's patch. I prefer to first run multi-gem5 before starting to
> >>review
> >>>it.
> >>>
> >>>Thank you,
> >>>Mohammad
> >>>
> >>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
> >>***@arm.com>
> >>>wrote:
> >>>
> >>>>Hi Steve,
> >>>>
> >>>>Apologies for the confusion. We are on the same page. My point is
> >>that
> >>we
> >>>>cannot simply take a little bit of patch A and a little bit of
> >>patch B.
> >>>>This change involves a lot of code, and we need to approach this in
> >>a
> >>>>structured fashion. My proposal is to do it bottom up, and start by
> >>>>getting the basic support in place. Since
> >>>http://reviews.gem5.org/r/2826/
> >>>>has already been on the review board for a few months, I am merely
> >>>>suggesting that the it would be a good start to relate the newly
> >>posted
> >>>>patches to what is already there.
> >>>>
> >>>>Andreas
> >>>>
> >>>>
> >>>>
> >>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> >>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
> >>>>
> >>>>>Hi Andreas,
> >>>>>
> >>>>>I'm a little confused by your email---you say you're fundamentally
> >>>opposed
> >>>>>to looking at both patches and picking the best features, then you
> >>point
> >>>>>out that the patches Curtis posted have the feature of better
> >>>>>checkpointing
> >>>>>support so we should pick that :).
> >>>>>
> >>>>>Obviously we can't just pick patch A from Mohammad's set and patch
> >>B
> >>>from
> >>>>>Curtis's set and expect them to work together, but I think that
> >>having
> >>>>>both
> >>>>>sets of patches available and comparing and contrasting the two
> >>>>>implementations should enable us to get to a single implementation
> >>>that's
> >>>>>the best of both. Someone will have to make the effort of
> >>integrating
> >>>the
> >>>>>better ideas from one set into the other set to create a new
> >>unified
> >>set
> >>>>>of
> >>>>>patches; (or maybe we commit one set and then integrate the best of
> >>the
> >>>>>other set as patches on top of that), but the first step is to
> >>identify
> >>>>>what "the best of both" is. Having Mohammad look at Curtis's
> >>patches,
> >>>and
> >>>>>Curtis (or someone else from ARM) closely examine Mohammad's
> >>patches
> >>>would
> >>>>>be a great start. I intend to review them both, though
> >>unfortunately
> >>my
> >>>>>time has been scarce lately---I'm hoping to squeeze that in later
> >>this
> >>>>>week.
> >>>>>
> >>>>>Once we've had a few people look at both, we can discuss the pros
> >>and
> >>>cons
> >>>>>of each, then discuss the strategy for getting the best features
> >>in.
> >>So
> >>>>>far I've heard that Mohammad's patches have a better network model
> >>but
> >>>the
> >>>>>ARM patches have better checkpointing support; that seems like a
> >>good
> >>>>>start.
> >>>>>
> >>>>>Steve
> >>>>>
> >>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> >>>***@arm.com
> >>>>>
> >>>>>wrote:
> >>>>>
> >>>>>>Hi all,
> >>>>>>
> >>>>>>Great work. However, I fundamentally do not believe in the
> >>approach
> >>of
> >>>>>>Œletting reviewers pick the best features¹. There is no way we
> >>would
> >>>>>>ever
> >>>>>>get something working out if it. We need to get _one_ working
> >>solution
> >>>>>>here, and figure out how to best get there. I would propose to
> >>do it
> >>>>>>bottom up, starting with the basic multi-simulator instance
> >>support,
> >>>>>>checkpointing support, and then move on to the network between
> >>the
> >>>>>>simulator instances.
> >>>>>>
> >>>>>>Thus, I propose we go with the low-level plumbing and checkpoint
> >>>support
> >>>>>>from what Curtis has posted. I believe proper checkpointing
> >>support
> >>to
> >>>>>>be
> >>>>>>the most challenging, and from what I can tell this is far more
> >>>limited
> >>>>>>in
> >>>>>>what you just posted Mohammad. Could you perhaps review Curtis
> >>patches
> >>>>>>based on your insights, and we can try and get these patches in
> >>shape
> >>>>>>and
> >>>>>>committed asap.
> >>>>>>
> >>>>>>Once we have the baseline functionality in place, then we can
> >>start
> >>>>>>looking at the more elaborate network models.
> >>>>>>
> >>>>>>Does this sound reasonable?
> >>>>>>
> >>>>>>Thanks,
> >>>>>>
> >>>>>>Andreas
> >>>>>>
> >>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> >>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> >>>>>>
> >>>>>>>Hello All,
> >>>>>>>
> >>>>>>>I have submitted a chain of patches which enables gem5 to
> >>simulate
> >>a
> >>>>>>>cluster on multiple physical hosts:
> >>>>>>>
> >>>>>>>http://reviews.gem5.org/r/2909/
> >>>>>>>http://reviews.gem5.org/r/2910/
> >>>>>>>http://reviews.gem5.org/r/2912/
> >>>>>>>http://reviews.gem5.org/r/2913/
> >>>>>>>http://reviews.gem5.org/r/2914/
> >><http://reviews.gem5.org/r/2914/>
> >>>>>>>
> >>>>>>>and a patch that contains run scripts for a simple experiment:
> >>>>>>>http://reviews.gem5.org/r/2915/
> >>>>>>>
> >>>>>>>We have run several benchmarks using this infrastructure,
> >>including
> >>>NAS
> >>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
> >>>>>>>(http://prof.ict.ac.cn/DCBench/),
> >>>>>>>and would be happy to share scripts/diskimages.
> >>>>>>>
> >>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or less
> >>the
> >>>>>>same
> >>>>>>>as
> >>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
> >>*network
> >>>>>>model
> >>>>>>>is
> >>>>>>>more thorough; it also enables modeling different network
> >>topologies.
> >>>>>>>Having both set of changes together let reviewers to pick best
> >>>features
> >>>>>>>from both works.
> >>>>>>>
> >>>>>>>Thank you,
> >>>>>>>Mohammad Alian
> >>>>>>>_______________________________________________
> >>>>>>>gem5-dev mailing list
> >>>>>>>gem5-***@gem5.org
> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>
> >>>>>>
> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >>attachments
> >>>are
> >>>>>>confidential and may also be privileged. If you are not the
> >>intended
> >>>>>>recipient, please notify the sender immediately and do not
> >>disclose
> >>>the
> >>>>>>contents to any other person, use it for any purpose, or store or
> >>copy
> >>>>>>the
> >>>>>>information in any medium. Thank you.
> >>>>>>
> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >>>>>>Registered in England & Wales, Company No: 2557590
> >>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> >>CB1
> >>>>>>9NJ,
> >>>>>>Registered in England & Wales, Company No: 2548782
> >>>>>>_______________________________________________
> >>>>>>gem5-dev mailing list
> >>>>>>gem5-***@gem5.org
> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>
> >>>>>_______________________________________________
> >>>>>gem5-dev mailing list
> >>>>>gem5-***@gem5.org
> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>
> >>>>
> >>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
> >>are
> >>>>confidential and may also be privileged. If you are not the intended
> >>>>recipient, please notify the sender immediately and do not disclose
> >>the
> >>>>contents to any other person, use it for any purpose, or store or
> >>copy
> >>>the
> >>>>information in any medium. Thank you.
> >>>>
> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >>>>Registered in England & Wales, Company No: 2557590
> >>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >>>>Registered in England & Wales, Company No: 2548782
> >>>>_______________________________________________
> >>>>gem5-dev mailing list
> >>>>gem5-***@gem5.org
> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>
> >>>_______________________________________________
> >>>gem5-dev mailing list
> >>>gem5-***@gem5.org
> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>_______________________________________________
> >>>gem5-dev mailing list
> >>>gem5-***@gem5.org
> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>
> >>_______________________________________________
> >>gem5-dev mailing list
> >>gem5-***@gem5.org
> >>http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >>-- IMPORTANT NOTICE: The contents of this email and any attachments are
> >>confidential and may also be privileged. If you are not the intended
> >>recipient, please notify the sender immediately and do not disclose the
> >>contents to any other person, use it for any purpose, or store or copy
> >>the
> >>information in any medium. Thank you.
> >>
> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >>Registered in England & Wales, Company No: 2557590
> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >>Registered in England & Wales, Company No: 2548782
> >>
> >>_______________________________________________
> >>gem5-dev mailing list
> >>gem5-***@gem5.org
> >>http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >_______________________________________________
> >gem5-dev mailing list
> >gem5-***@gem5.org
> >http://m5sim.org/mailman/listinfo/gem5-dev
>
>
>
>
>
>
>
>
>
>
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Andreas Hansson
2015-07-02 15:34:37 UTC
Permalink
Hi all,

I think we need to up-level this a bit. From our perspective (and I
suspect in general):

1. Robustness is important. Having a design that _may_ break, however
unlikely is simply not an option.

2. Performance and scaling is important. We can compare actual numbers
here, and I am fairly sure the two solutions are on par. Let’s quantify
that though.

3. Checkpointing must not rely on synchronicity. It is vital for several
workloads that we can checkpoint the various gem5 instances at different
Ticks (due to the way the workloads are constructed).

Andreas

On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
<gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:

>Thanks Gabor for the reply.
>
>I feel this conversation is useful as we can find out pros/cons of each
>design.
>Please find my response in-lined below.
>
>Thank you,
>Mohammad
>
>On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com> wrote:
>
>> Hi All,
>>
>> Sorry for the missing indentation in my previous e-mail! (This was my
>> first e-mail to the dev-list so I could not simply use “reply"). Below
>>is
>> the same message, hopefully in more readable form.
>>
>> ====================================
>>
>> Hi All,
>>
>> Thank you Mohammad for your elaboration on the issues!
>>
>> I have written most of the multi-gem5 patch so let me add some more
>> clarifications and answer to your concerns. My comments are inline
>>below.
>>
>> Thanks,
>> - Gabor
>>
>> On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
>>
>> >Hi All,
>> >
>> >Curtis-Thank you for listing some of the differences. I was waiting for
>> >the
>> >completed multi-gem5 patch before I send my review. Please see my
>>inline
>> >response below. I¹ve addressed the concerns that you¹ve raised. Also,
>>I¹ve
>> >added a bit more to the comparison.
>> >
>> >-* Synchronization.
>> >
>> >pd-gem5 implements this in Python (not a problem in itself;
>>aesthetically
>> >
>> >this is nice, but...). The issue is that pd-gem5's data packets and
>> >
>> >barrier messages travel over different sockets. Since pd-gem5 could
>>see
>> >
>> >data packets passing synchronization barriers, it could create an
>> >
>> >inconsistent checkpoint.
>> >
>> >multi-gem5's synchronization is implemented in C++ using sync events,
>>but
>> >
>> >more importantly, the messages queue up in the same stream and so
>>cannot
>> >
>> >have the issue just described. (Event ordering is often crucial in
>> >
>> >snapshot protocols.) Therefore we feel that multi-gem5 is a more robust
>> >
>> >solution in this respect.
>> >
>> >Each packet in pd-gem5 has a time-stamp. So even if data packets pass
>> >synchronization barriers (in another word data packets arrive early at
>>the
>> >destination node), destination node process packets based on their
>> >timestamp. Actually allowing data packets to pass sync barriers is a
>>nice
>> >feature that can reduce the likelihood of late packet reception.
>>Ordering
>> >of data messages that flow over pd-gem5 nodes is also preserved in
>>pd-gem5
>> >implementation.
>>
>> This seems to be a misunderstanding. Maybe the wording was not precise
>> before.The problem is not a data packet that “passing" a sync barrier
>> but the other way around, a sync barrier that can pass a data packet
>> (e.g. while the data packet is waiting in the host operating system
>> socket layer). If that happens, the packet will arrive later than it
>>was
>> supposed to and it may miss the computed receive tick.
>>
>> For instance, let’s assume that the quantum coincides with the simulated
>> Ether link delay. (This is the optimal choice of quantum to minimize the
>> number of sync barriers.) If a data packet is sent right at the
>>beginning
>> of a quantum then this packet must arrive at the destination gem5
>>process
>> within the same quantum in order not to miss its receive tick at the
>>very
>> beginning of the next quantum. If the sync barrier can pass the data
>>packet
>> then the data packet may arrive only during the next quantum (or in
>> extreme conditions even later than that) so when it arrives the receiver
>> gem5 may pass already the receive tick.
>>
>> This argument makes more sense than the previous one. Note that gem5 is
>>a
>cycle accurate simulator and it runs orders of magnitude slower that real
>hardware. So it's almost impossible that the flight time of packet through
>real network turns to be more that simulation time of one quantum. We ran
>a
>set of experiments just for this purpose: with quantum size equal to
>etherlink delay, we never got any late arrival violation (what you
>described) for full NAS benchmarks suit (please refer to the paper).
>
>multi-gem5 is optimized for a case that almost never happens! and
>scarifying speedup for no gain.
>
>
>> Time-stamping does help with this issue. Also, if a data packet is
>>waiting
>> in the host operating system socket layer when the simulation thread
>>exits
>> to python to complete the next sync barrier then the packet will not go
>> into the checkpoint that may follow that sync barrier.
>>
>> That's a good point. Current pd-gem5 checkpointing mechanism might miss
>packets that have been sent during previous quantum and are waiting in OS
>socket buffer. I should add some code inside ethertap serialization
>function to drain ethertap socket before writing checkpoint. I will update
>pd-gem5 patch accordingly.
>
>>
>> >What you mentioned as an advantage for multi-gem5 is actually a key
>> >disadvantage: buffering sync messages behind data packets can add up to
>> >the
>> >synchronization overhead and slow down simulation significantly.
>>
>> The purpose of sync messages is to make sure that the data packets
>>arrive
>> in time (in terms of simulated time) at the destination so they can be
>> scheduled for being received at the proper computed tick. Sync messages
>> also make sure that no data packets are in flight when a sync barrier
>> completes before we take a checkpoint. They definitely add overhead for
>> the simulation but they are necessary for the correctness of the
>> simulation.
>>
>> The receive thread in multi-gem5 reads out packets from the socket in
>> parallel with the simulation thread so packets normally will not be
>> "queueing up” before a sync barrier message. There is definitely room
>> for improvements in the current implementation for reducing the
>> synchronization overhead but that is likely true for pd-gem5, too.
>> The important thing here is that the solution must provide correctness
>> (robustness) first.
>>
>> pd-gem5 provides correctness. Please read my previous comment. The whole
>purpose of multi/pd-gem5 is to parallelize simulation with minimal
>overhead
>and gain speedup. If you fail to do so, nobody will use your tool.
>
>
>> >Also,
>> >multi-gem5 send huge sized messages (multiHeaderPkt) through network to
>> >perform each synchronization point, which increases synchronization
>> >overhead further. In pd-gem5, we choose to send just one character as
>>sync
>> >message through a separate socket to reduce synchronization overhead.
>>
>> The TCP/IP message size is unlikely the bottleneck here. Multi-gem5 will
>> send ~50 bytes more in a sync barrier message than pd-gem5 but that
>>bigger
>> sync message still fits into a single ethernet frame on the wire. The
>> end-to-end latency overhead that is caused by 50 bytes extra payload for
>> a small single frame TCP/IP message is likely to fall into the “noise"
>> category if one tries to measure it in a real cluster.
>>
>> You should prove your hypothesis experimentally. Each gem5 process
>send/receive sync messages at the end of every quantum. Say you are
>simulating "N" node computer cluster with "M" different configuration.
>Then
>you will have N*M gem5 processes that send/receive these 50 Bytes (it
>think
>it's more) extra data at the same time over network ...
>
>Furthermore, multi-gem5 send a header before each data message. Comparing
>with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12 least
>significant digits of the Tick) to each data packet. I don't know exactly
>how large are these "MultiHeaderPkt", but it just has two Tick field that
>each is 64 Bytes! Also, header packets are separate TCP packets, so you
>pay
>for sending two separate packets for each data packet. And worst, you
>serialize all of these with sync messages.
>
>
>> >
>> >* Packet handling.
>> >
>> >pd-gem5 uses EtherTap for data packets but changed the polling
>>mechanism
>> >
>> >to go through the main event queue. Since this rate is actually linked
>> >
>> >with simulator progress, it cannot guarantee that the packets are
>> >serviced
>> >
>> >at regular intervals of real time. This can lead to packets queueing
>>up
>> >
>> >which would contribute to the synchronization issues mentioned above.
>> >
>> >multi-gem5 uses plain sockets with separate receive threads and so does
>> >not
>> >
>> >have this issue.
>> >
>> >I think again you are pointing to your first concern that I¹ve
>>explained
>> >above. Packets that have queued up in EtherTap socket, will be
>>processed
>> >and delivered to simulation environment at the beginning of next
>> >simulation
>> >quantum.
>> >
>> >Please notice that multi-gem5 introduces a new simObjects to interface
>> >simulation environment to real world which is redundant. This
>> >functionality
>> >is already there by EtherTap.
>>
>> Except that the EtherTap solution does not provide a correct (robust)
>> solution for the synchronization problem.
>>
>> Please read my first/second comments.
>
>
>> >
>> >* Checkpoint accuracy.
>> >
>> >A user would like to have a checkpoint at precisely the time the
>> >
>> >'m5 checkpoint' operation is executed so as to not miss any of the
>> >
>> >area of interest in his application.
>> >
>> >pd-gem5 requires that simulation finish the current quantum
>> >
>> >before checkpointing, so it cannot provide this.
>> >
>> >(Shortening the quantum can help, but usually the snapshot is being
>>taken
>> >
>> >while 'fast-forwarding', i.e. simulating as fast as possible, which
>>would
>> >
>> >motivate a longer quantum.)
>> >
>> >multi-gem5 can enter the drain cycle immediately upon receiving a
>> >
>> >checkpoint request. We find this accuracy highly desirable.
>> >
>> >It¹s true that if you have a large quantum size then there would be
>>some
>> >discrepancy between the m5_ckpt instruction tick and the actual dump
>>tick.
>> >Based on multi-gem5 code, my understanding is that you send async
>> >checkpoint message as soon as one of the gem5 processes encounter
>>m5_ckpt
>> >instruction. But I¹m not sure how you fix the aforementioned issue,
>> >because
>> >you have to sync all gem5 processes before you start dumping
>>checkpoint,
>> >which necessitate a global synchronization beforehand.
>>
>> In multi-gem5, the gem5 process who encounters the m5_ckpt instruction
>> sends out an async checkpoint notification for the peer gem5 processes
>>and
>> then it starts the draining immediately (at the same tick). So the
>> checkpoint will be taken at the exact tick form the initiator process
>> point of view. The global synchronisation with the peer processes takes
>> place while the initiator process is still waiting at the same tick (i.e
>> the simulation thread is suspended). However, the receiver thread
>> Continues reading out the socket - while waiting for the global sync to
>> complete- to make sure that in-flight data packets from peer gem5
>>processes
>> are stored properly and saved into the checkpoint.
>>
>>
>So you mean multi-gem5 ends up with having gem5 processes with different
>ticks after checkpoint? In pd-gem5 we make sure that all gem5 processes
>start dumping checkpoint at the same tick. Are you sure that this is
>correct to have each gem5 process dump checkpoint at different ticks???
>
>I don't think this a correct checkpointing design. However, if you feel it
>is correct, I can change a couple of lines in "Simulation.py" and barrier
>scripts to implement the same functionality in pd-gem5. One thing that you
>are obsessed about is to make sure that there is no in-flight packets
>while
>we start dumping checkpoint, and you have all these complex mechanisms in
>place to insure that! I think you can 99.99999% make sure that there is no
>in-flight packet by waiting for 1 second after all gem5 processes finished
>their quantum simulation and then dump checkpoint. Do you really think
>that
>delivering a tcp packet would take more than 1 second in today's systems!?
>Always go for simple solutions ...
>
>
>
>> >
>> >By the way, we have a fix for this issue by introducing a new m5 pseudo
>> >instruction.
>>
>> I fail to see how a new pseudo instruction can solve the problem of
>> completing the full quantum in pd-gem5 before a checkpoint can be taken.
>> Could you please elaborate on that?
>>
>> As we take checkpoint while fast-forwarding and it is likely that we
>>relax
>synchronization for speedup purpose, a new pseudo instruction that can set
>quantum size (m5_qset) can be helpful. So, one can insert m5_qset in his
>benchmark source code before entering ROI that contains m5_ckpt to
>decrease
>quantum size beforehand and reduce the discrepancy between m5_ckpt tick
>and
>actual checkpoint tick. This is not included in pd-gem5 patch right now.
>
>
>> >
>> >* Implementation of network topology.
>> >
>> >pd-gem5 uses a separate gem5 process to act as a switch whereas
>>multi-gem5
>> >
>> >uses a standalone packet relay process.
>> >
>> >We haven't measured the overhead of pd-gem5's simulated switch yet, but
>> >
>> >we're confident that our approach is at least as fast and more
>>scalable.
>> >
>> >There is this flexibility in pd-gem5 to simulate a switch box alongside
>> >one
>> >of the other gem5 processes. However, it might make that gem5 process
>>the
>> >simulation bottleneck. One of the advantages of pd-gem5 over
>>multi-gem5 is
>> >that we use gem5 to simulate a switch box, which allows us to model any
>> >network topology by instantiating several Switch simObjects and
>> >interconnect them with EhterLink in an arbitrary fashion. A standalone
>>tcp
>> >server just can provide switch functionality (forwarding packets to
>> >destinations) and model a star network topology. Furthermore, it cannot
>> >model various network timings such as queueing delay, congestion, and
>> >routing latency. Also it has some accuracy issues that I will point out
>> >next.
>>
>> I agree with the complex topology argument. We already mentioned that
>> before as an advantage for pd-gem5 from the point of view of future
>> extensions. However, I do not agree that multi-gem5 cannot model
>>queueing
>> delays and congestions. For a simple crossbar switch, it can model
>>queueing
>> delays and congestions, but the receive queues are distributed among the
>> gem5 processes.
>>
>> It's true that you can model queuing delay of a simple crossbar by
>distributing queues across gem5 processes (end points). But to be able to
>do so you have to ensure the ordering of packets that you enqueue in the
>distributed queues. It is almost impossible without a synchronized switch
>box. You should have a reorder queue that reorders packets dynamically and
>updates timing parameter for each packet as well. I don't know how much
>progress have you had to ensure ordering scheme in multi-gem5 but you may
>already realized that how complex and error prone it can be. This argument
>is also related to my next argument for "Broken network timing".
>
>
>> >
>> >* Broken network timing:
>> >
>> >Forwarding packets between gem5 processes using a standalone tcp server
>> >can
>> >cause reordering between packets that have different source but same
>> >destination. It causes inaccurate network timing and worse of all
>> >non-deterministic simulation. pd-gem5 resolve this by reordering
>>packets
>> >at
>> >Switch process and then send them to their destination (it¹s possible
>>as
>> >switch is synchronized with the rest of the nodes).
>>
>> In multi-gem5, there is always a HeaderPkt that contains some meta
>> information for each data packet. The meta information include the send
>> tick and the sender rank (i.e. a unique ID of the sender gem5 process).
>> We use those information to define a well defined ordering of packets
>>even
>> if packets are arriving at the same receiver from different senders.
>>This
>> packet ordering scheme is still being tested so the corresponding patch
>>is
>> not on the RB yet.
>>
>> Please read my previous comment. The most important part of
>>multi/pd-gem5
>extension is ensuring accurate and deterministic simulation.
>
>
>> >
>> >* Amount of changes
>> >
>> >pd-gem5 introduce different modes in etherlink just to provide accurate
>> >timing for each component in the network subsystem (NIC, link, switch)
>>as
>> >well as capability of modeling different network topologies (mesh,
>>ring,
>> >fat tree, etc). To enable a simple functionality, like what multi-gem5
>> >provides, the amount of changes in gem5 can be limited to time-stamping
>> >packets and providing synchronization through python scripts. However,
>> >multi-gem5 re-implements functionalists that are already in gem5.
>>
>> This argument holds only if both implementations are correct (robust).
>>It
>> still seems to me that pd-gem5 does not provide correctness for the
>> synchronization/checkpointing parts.
>>
>> Again, please read my first comment for correctness of pd-gem5.
>
>
>> >
>> >* Integrating with gem5 mainstream:
>> >
>> >pd-gem5 launch script is written in python which is suited for
>>integration
>> >with gem5 python scripts. However multi-gem5 uses bash script. Also,
>>all
>> >source files in pd-gem5 are already parts of gem5 mainstream. However
>> >multi-gem5 has tcp_server.cc/hh that is a standalone process and cannot
>> be
>> >part of gem5.
>>
>> The multi-gem5 launch script is simply enough to rely only on the
>>shell. It
>> can obviously be easily re-written in python if that added any value.
>>The
>> tcp_server component is only a utility (like the "m5" utility that is
>>also
>> part of gem5).
>>
>> The thing is that it's more likely that users want to add some
>functionality to the run-script of multi/pd-gem5. E.g. pd-gem5 run-script
>supports launching simulations using a simulation pool management
>software (
>http://research.cs.wisc.edu/htcondor/). Using python enables users to
>easily add these kind of supports.
>
>
>>
>> Cheers,
>> - Gabor
>>
>>
>> >On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham <***@arm.com>
>> >wrote:
>> >
>> >>Hello everyone,
>> >>We have taken a look at how pd-gem5 compares with multi-gem5. While
>> >>intending
>> >>to deliver the same functionality, there are some crucial differences:
>> >>
>> >>* Synchronization.
>> >>
>> >> pd-gem5 implements this in Python (not a problem in itself;
>> >>aesthetically
>> >> this is nice, but...). The issue is that pd-gem5's data packets
>>and
>> >> barrier messages travel over different sockets. Since pd-gem5
>>could
>> >>see
>> >> data packets passing synchronization barriers, it could create an
>> >> inconsistent checkpoint.
>> >>
>> >> multi-gem5's synchronization is implemented in C++ using sync
>>events,
>> >>but
>> >> more importantly, the messages queue up in the same stream and so
>> >>cannot
>> >> have the issue just described. (Event ordering is often crucial
>>in
>> >> snapshot protocols.) Therefore we feel that multi-gem5 is a more
>> >>robust
>> >> solution in this respect.
>> >>
>> >>* Packet handling.
>> >>
>> >> pd-gem5 uses EtherTap for data packets but changed the polling
>> >>mechanism
>> >> to go through the main event queue. Since this rate is actually
>> >>linked
>> >> with simulator progress, it cannot guarantee that the packets are
>> >>serviced
>> >> at regular intervals of real time. This can lead to packets
>> >>queueing up
>> >> which would contribute to the synchronization issues mentioned
>>above.
>> >>
>> >> multi-gem5 uses plain sockets with separate receive threads and so
>> >>does
>> >>not
>> >> have this issue.
>> >>
>> >>* Checkpoint accuracy.
>> >>
>> >> A user would like to have a checkpoint at precisely the time the
>> >> 'm5 checkpoint' operation is executed so as to not miss any of the
>> >> area of interest in his application.
>> >>
>> >> pd-gem5 requires that simulation finish the current quantum
>> >> before checkpointing, so it cannot provide this.
>> >>
>> >> (Shortening the quantum can help, but usually the snapshot is being
>> >>taken
>> >> while 'fast-forwarding', i.e. simulating as fast as possible, which
>> >>would
>> >> motivate a longer quantum.)
>> >>
>> >> multi-gem5 can enter the drain cycle immediately upon receiving a
>> >> checkpoint request. We find this accuracy highly desirable.
>> >>
>> >>* Implementation of network topology.
>> >>
>> >> pd-gem5 uses a separate gem5 process to act as a switch whereas
>> >>multi-gem5
>> >> uses a standalone packet relay process.
>> >>
>> >> We haven't measured the overhead of pd-gem5's simulated switch yet,
>> >>but
>> >> we're confident that our approach is at least as fast and more
>> >>scalable.
>> >>
>> >>
>> >>Thanks,
>> >>Curtis
>> >>________________________________________
>> >>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad
>>Alian [
>> >>***@wisc.edu]
>> >>Sent: Friday, June 26, 2015 7:37 PM
>> >>To: gem5 Developer List
>> >>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
>> >>system
>> >>on multiple physical hosts
>> >>
>> >>Hi Anthony,
>> >>
>> >>I think that would be a good option, then I can add pd-gem5
>> >>functionality
>> >>on top of that. Right now I've simplified your implementation. Also, I
>> >>think I had found some bugs in your patch that I cannot remember now.
>>If
>> >>you decided to ship EtherSwitch patch, let me know to give you a
>>review
>> >>on
>> >>that.
>> >>
>> >>Thanks,
>> >>Mohammad
>> >>
>> >>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
>> >>***@amd.com> wrote:
>> >>
>> >>>Would it make sense for me to ship the EtherSwitch patch first, since
>> >>it
>> >>>has utility on its own, and then we can decide which of the
>> >>"multi-gem5"
>> >>>approaches is best, or if it's some combination of both?
>> >>>
>> >>>The only reason I never shipped it was because Steve raised an issue
>> >>that
>> >>>I didn't have a good alternative for, and didn't have the time to
>>look
>> >>into
>> >>>one at that time.
>> >>>________________________________________
>> >>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
>> >>Alian [
>> >>>***@wisc.edu]
>> >>>Sent: Wednesday, June 24, 2015 12:43 PM
>> >>>To: gem5 Developer List
>> >>>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
>> >>system
>> >>>on multiple physical hosts
>> >>>
>> >>>Hi Andreas,
>> >>>
>> >>>Thanks for the comment.
>> >>>I think the checkpointing support in both works is the same. Here is
>> >>how
>> >>>checkpointing support is implemented in pd-gem5:
>> >>>
>> >>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
>> >>>instruction, it will send a ³recv-ckpt² signal to the
>> >>>³barrier² process. Then the ³barrier² process sends a ³take-ckpt²
>> >>signal
>> >>to
>> >>>all the simulated nodes
>> >>>(including the node that encountered m5-checkpoint) at the end of the
>> >>>current simulation quantum. On the reception of
>> >>>³take-ckpt² signal, gem5 processes start dumping check-points. This
>> >>makes
>> >>>each simulated node dump a checkpoint
>> >>>at the same simulated time point while ensuring there is no in-flight
>> >>>packets.
>> >>>
>> >>>I believe this is the same as multi-gem5 patch approach for
>>checkpoint
>> >>>support (based on the commit message of
>> >>http://reviews.gem5.org/r/2865/
>> >>).
>> >>>Also, we have tested our mechanism with several benchmarks and it
>> >>works.
>> >>As
>> >>>Steve suggested, I'll look into Curtis's patch and try to review it
>>as
>> >>>well.
>> >>>But as Nilay also mentioned earlier, there are some codes missing in
>> >>>Curtis's patch. I prefer to first run multi-gem5 before starting to
>> >>review
>> >>>it.
>> >>>
>> >>>Thank you,
>> >>>Mohammad
>> >>>
>> >>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
>> >>***@arm.com>
>> >>>wrote:
>> >>>
>> >>>>Hi Steve,
>> >>>>
>> >>>>Apologies for the confusion. We are on the same page. My point is
>> >>that
>> >>we
>> >>>>cannot simply take a little bit of patch A and a little bit of
>> >>patch B.
>> >>>>This change involves a lot of code, and we need to approach this in
>> >>a
>> >>>>structured fashion. My proposal is to do it bottom up, and start by
>> >>>>getting the basic support in place. Since
>> >>>http://reviews.gem5.org/r/2826/
>> >>>>has already been on the review board for a few months, I am merely
>> >>>>suggesting that the it would be a good start to relate the newly
>> >>posted
>> >>>>patches to what is already there.
>> >>>>
>> >>>>Andreas
>> >>>>
>> >>>>
>> >>>>
>> >>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
>> >>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
>> >>>>
>> >>>>>Hi Andreas,
>> >>>>>
>> >>>>>I'm a little confused by your email---you say you're fundamentally
>> >>>opposed
>> >>>>>to looking at both patches and picking the best features, then you
>> >>point
>> >>>>>out that the patches Curtis posted have the feature of better
>> >>>>>checkpointing
>> >>>>>support so we should pick that :).
>> >>>>>
>> >>>>>Obviously we can't just pick patch A from Mohammad's set and patch
>> >>B
>> >>>from
>> >>>>>Curtis's set and expect them to work together, but I think that
>> >>having
>> >>>>>both
>> >>>>>sets of patches available and comparing and contrasting the two
>> >>>>>implementations should enable us to get to a single implementation
>> >>>that's
>> >>>>>the best of both. Someone will have to make the effort of
>> >>integrating
>> >>>the
>> >>>>>better ideas from one set into the other set to create a new
>> >>unified
>> >>set
>> >>>>>of
>> >>>>>patches; (or maybe we commit one set and then integrate the best of
>> >>the
>> >>>>>other set as patches on top of that), but the first step is to
>> >>identify
>> >>>>>what "the best of both" is. Having Mohammad look at Curtis's
>> >>patches,
>> >>>and
>> >>>>>Curtis (or someone else from ARM) closely examine Mohammad's
>> >>patches
>> >>>would
>> >>>>>be a great start. I intend to review them both, though
>> >>unfortunately
>> >>my
>> >>>>>time has been scarce lately---I'm hoping to squeeze that in later
>> >>this
>> >>>>>week.
>> >>>>>
>> >>>>>Once we've had a few people look at both, we can discuss the pros
>> >>and
>> >>>cons
>> >>>>>of each, then discuss the strategy for getting the best features
>> >>in.
>> >>So
>> >>>>>far I've heard that Mohammad's patches have a better network model
>> >>but
>> >>>the
>> >>>>>ARM patches have better checkpointing support; that seems like a
>> >>good
>> >>>>>start.
>> >>>>>
>> >>>>>Steve
>> >>>>>
>> >>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
>> >>>***@arm.com
>> >>>>>
>> >>>>>wrote:
>> >>>>>
>> >>>>>>Hi all,
>> >>>>>>
>> >>>>>>Great work. However, I fundamentally do not believe in the
>> >>approach
>> >>of
>> >>>>>>Œletting reviewers pick the best features¹. There is no way we
>> >>would
>> >>>>>>ever
>> >>>>>>get something working out if it. We need to get _one_ working
>> >>solution
>> >>>>>>here, and figure out how to best get there. I would propose to
>> >>do it
>> >>>>>>bottom up, starting with the basic multi-simulator instance
>> >>support,
>> >>>>>>checkpointing support, and then move on to the network between
>> >>the
>> >>>>>>simulator instances.
>> >>>>>>
>> >>>>>>Thus, I propose we go with the low-level plumbing and checkpoint
>> >>>support
>> >>>>>>from what Curtis has posted. I believe proper checkpointing
>> >>support
>> >>to
>> >>>>>>be
>> >>>>>>the most challenging, and from what I can tell this is far more
>> >>>limited
>> >>>>>>in
>> >>>>>>what you just posted Mohammad. Could you perhaps review Curtis
>> >>patches
>> >>>>>>based on your insights, and we can try and get these patches in
>> >>shape
>> >>>>>>and
>> >>>>>>committed asap.
>> >>>>>>
>> >>>>>>Once we have the baseline functionality in place, then we can
>> >>start
>> >>>>>>looking at the more elaborate network models.
>> >>>>>>
>> >>>>>>Does this sound reasonable?
>> >>>>>>
>> >>>>>>Thanks,
>> >>>>>>
>> >>>>>>Andreas
>> >>>>>>
>> >>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
>> >>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>> >>>>>>
>> >>>>>>>Hello All,
>> >>>>>>>
>> >>>>>>>I have submitted a chain of patches which enables gem5 to
>> >>simulate
>> >>a
>> >>>>>>>cluster on multiple physical hosts:
>> >>>>>>>
>> >>>>>>>http://reviews.gem5.org/r/2909/
>> >>>>>>>http://reviews.gem5.org/r/2910/
>> >>>>>>>http://reviews.gem5.org/r/2912/
>> >>>>>>>http://reviews.gem5.org/r/2913/
>> >>>>>>>http://reviews.gem5.org/r/2914/
>> >><http://reviews.gem5.org/r/2914/>
>> >>>>>>>
>> >>>>>>>and a patch that contains run scripts for a simple experiment:
>> >>>>>>>http://reviews.gem5.org/r/2915/
>> >>>>>>>
>> >>>>>>>We have run several benchmarks using this infrastructure,
>> >>including
>> >>>NAS
>> >>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
>> >>>>>>>(http://prof.ict.ac.cn/DCBench/),
>> >>>>>>>and would be happy to share scripts/diskimages.
>> >>>>>>>
>> >>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or less
>> >>the
>> >>>>>>same
>> >>>>>>>as
>> >>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
>> >>*network
>> >>>>>>model
>> >>>>>>>is
>> >>>>>>>more thorough; it also enables modeling different network
>> >>topologies.
>> >>>>>>>Having both set of changes together let reviewers to pick best
>> >>>features
>> >>>>>>>from both works.
>> >>>>>>>
>> >>>>>>>Thank you,
>> >>>>>>>Mohammad Alian
>> >>>>>>>_______________________________________________
>> >>>>>>>gem5-dev mailing list
>> >>>>>>>gem5-***@gem5.org
>> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>
>> >>>>>>
>> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>> >>attachments
>> >>>are
>> >>>>>>confidential and may also be privileged. If you are not the
>> >>intended
>> >>>>>>recipient, please notify the sender immediately and do not
>> >>disclose
>> >>>the
>> >>>>>>contents to any other person, use it for any purpose, or store or
>> >>copy
>> >>>>>>the
>> >>>>>>information in any medium. Thank you.
>> >>>>>>
>> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>9NJ,
>> >>>>>>Registered in England & Wales, Company No: 2557590
>> >>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>> >>CB1
>> >>>>>>9NJ,
>> >>>>>>Registered in England & Wales, Company No: 2548782
>> >>>>>>_______________________________________________
>> >>>>>>gem5-dev mailing list
>> >>>>>>gem5-***@gem5.org
>> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>
>> >>>>>_______________________________________________
>> >>>>>gem5-dev mailing list
>> >>>>>gem5-***@gem5.org
>> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>
>> >>>>
>> >>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
>> >>are
>> >>>>confidential and may also be privileged. If you are not the intended
>> >>>>recipient, please notify the sender immediately and do not disclose
>> >>the
>> >>>>contents to any other person, use it for any purpose, or store or
>> >>copy
>> >>>the
>> >>>>information in any medium. Thank you.
>> >>>>
>> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> >>>>Registered in England & Wales, Company No: 2557590
>> >>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>9NJ,
>> >>>>Registered in England & Wales, Company No: 2548782
>> >>>>_______________________________________________
>> >>>>gem5-dev mailing list
>> >>>>gem5-***@gem5.org
>> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>
>> >>>_______________________________________________
>> >>>gem5-dev mailing list
>> >>>gem5-***@gem5.org
>> >>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>_______________________________________________
>> >>>gem5-dev mailing list
>> >>>gem5-***@gem5.org
>> >>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>
>> >>_______________________________________________
>> >>gem5-dev mailing list
>> >>gem5-***@gem5.org
>> >>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>
>> >>-- IMPORTANT NOTICE: The contents of this email and any attachments
>>are
>> >>confidential and may also be privileged. If you are not the intended
>> >>recipient, please notify the sender immediately and do not disclose
>>the
>> >>contents to any other person, use it for any purpose, or store or copy
>> >>the
>> >>information in any medium. Thank you.
>> >>
>> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> >>Registered in England & Wales, Company No: 2557590
>> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>9NJ,
>> >>Registered in England & Wales, Company No: 2548782
>> >>
>> >>_______________________________________________
>> >>gem5-dev mailing list
>> >>gem5-***@gem5.org
>> >>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>
>> >_______________________________________________
>> >gem5-dev mailing list
>> >gem5-***@gem5.org
>> >http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -- IMPORTANT NOTICE: The contents of this email and any attachments are
>> confidential and may also be privileged. If you are not the intended
>> recipient, please notify the sender immediately and do not disclose the
>> contents to any other person, use it for any purpose, or store or copy
>>the
>> information in any medium. Thank you.
>>
>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> Registered in England & Wales, Company No: 2557590
>> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>> Registered in England & Wales, Company No: 2548782
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>_______________________________________________
>gem5-dev mailing list
>gem5-***@gem5.org
>http://m5sim.org/mailman/listinfo/gem5-dev


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Gutierrez, Anthony
2015-07-02 16:40:06 UTC
Permalink
I agree, and I think points 1 and 3 are also non-negotiable. Given that, I think the multi-gem5 design is more robust and fits in with the overall gem5 design philosophy. I've been slowly going over the code and see no major problems - certainly nothing to warrant keeping it out of the code base.

I was planning on giving it a ship it today, so I'll do that now.

-----Original Message-----
From: gem5-dev [mailto:gem5-dev-***@gem5.org] On Behalf Of Andreas Hansson
Sent: Thursday, July 02, 2015 8:35 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed system on multiple physical hosts

Hi all,

I think we need to up-level this a bit. From our perspective (and I suspect in general):

1. Robustness is important. Having a design that _may_ break, however unlikely is simply not an option.

2. Performance and scaling is important. We can compare actual numbers here, and I am fairly sure the two solutions are on par. Let’s quantify that though.

3. Checkpointing must not rely on synchronicity. It is vital for several workloads that we can checkpoint the various gem5 instances at different Ticks (due to the way the workloads are constructed).

Andreas

On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
<gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:

>Thanks Gabor for the reply.
>
>I feel this conversation is useful as we can find out pros/cons of each
>design.
>Please find my response in-lined below.
>
>Thank you,
>Mohammad
>
>On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com> wrote:
>
>> Hi All,
>>
>> Sorry for the missing indentation in my previous e-mail! (This was my
>>first e-mail to the dev-list so I could not simply use “reply"). Below
>>is the same message, hopefully in more readable form.
>>
>> ====================================
>>
>> Hi All,
>>
>> Thank you Mohammad for your elaboration on the issues!
>>
>> I have written most of the multi-gem5 patch so let me add some more
>>clarifications and answer to your concerns. My comments are inline
>>below.
>>
>> Thanks,
>> - Gabor
>>
>> On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
>>
>> >Hi All,
>> >
>> >Curtis-Thank you for listing some of the differences. I was waiting
>> >for the completed multi-gem5 patch before I send my review. Please
>> >see my
>>inline
>> >response below. I¹ve addressed the concerns that you¹ve raised.
>> >Also,
>>I¹ve
>> >added a bit more to the comparison.
>> >
>> >-* Synchronization.
>> >
>> >pd-gem5 implements this in Python (not a problem in itself;
>>aesthetically
>> >
>> >this is nice, but...). The issue is that pd-gem5's data packets and
>> >
>> >barrier messages travel over different sockets. Since pd-gem5 could
>>see
>> >
>> >data packets passing synchronization barriers, it could create an
>> >
>> >inconsistent checkpoint.
>> >
>> >multi-gem5's synchronization is implemented in C++ using sync
>> >events,
>>but
>> >
>> >more importantly, the messages queue up in the same stream and so
>>cannot
>> >
>> >have the issue just described. (Event ordering is often crucial in
>> >
>> >snapshot protocols.) Therefore we feel that multi-gem5 is a more
>> >robust
>> >
>> >solution in this respect.
>> >
>> >Each packet in pd-gem5 has a time-stamp. So even if data packets
>> >pass synchronization barriers (in another word data packets arrive
>> >early at
>>the
>> >destination node), destination node process packets based on their
>> >timestamp. Actually allowing data packets to pass sync barriers is a
>>nice
>> >feature that can reduce the likelihood of late packet reception.
>>Ordering
>> >of data messages that flow over pd-gem5 nodes is also preserved in
>>pd-gem5
>> >implementation.
>>
>> This seems to be a misunderstanding. Maybe the wording was not
>>precise before.The problem is not a data packet that “passing" a sync
>>barrier but the other way around, a sync barrier that can pass a data
>>packet (e.g. while the data packet is waiting in the host operating
>>system socket layer). If that happens, the packet will arrive later
>>than it was supposed to and it may miss the computed receive tick.
>>
>> For instance, let’s assume that the quantum coincides with the
>>simulated Ether link delay. (This is the optimal choice of quantum to
>>minimize the number of sync barriers.) If a data packet is sent
>>right at the beginning of a quantum then this packet must arrive at
>>the destination gem5 process within the same quantum in order not to
>>miss its receive tick at the very beginning of the next quantum. If
>>the sync barrier can pass the data packet then the data packet may
>>arrive only during the next quantum (or in extreme conditions even
>>later than that) so when it arrives the receiver
>> gem5 may pass already the receive tick.
>>
>> This argument makes more sense than the previous one. Note that gem5
>>is a
>cycle accurate simulator and it runs orders of magnitude slower that
>real hardware. So it's almost impossible that the flight time of packet
>through real network turns to be more that simulation time of one
>quantum. We ran a set of experiments just for this purpose: with
>quantum size equal to etherlink delay, we never got any late arrival
>violation (what you
>described) for full NAS benchmarks suit (please refer to the paper).
>
>multi-gem5 is optimized for a case that almost never happens! and
>scarifying speedup for no gain.
>
>
>> Time-stamping does help with this issue. Also, if a data packet is
>>waiting in the host operating system socket layer when the simulation
>>thread exits to python to complete the next sync barrier then the
>>packet will not go into the checkpoint that may follow that sync
>>barrier.
>>
>> That's a good point. Current pd-gem5 checkpointing mechanism might
>> miss
>packets that have been sent during previous quantum and are waiting in
>OS socket buffer. I should add some code inside ethertap serialization
>function to drain ethertap socket before writing checkpoint. I will
>update
>pd-gem5 patch accordingly.
>
>>
>> >What you mentioned as an advantage for multi-gem5 is actually a key
>> >disadvantage: buffering sync messages behind data packets can add up
>> >to the synchronization overhead and slow down simulation
>> >significantly.
>>
>> The purpose of sync messages is to make sure that the data packets
>>arrive in time (in terms of simulated time) at the destination so
>>they can be scheduled for being received at the proper computed tick.
>>Sync messages also make sure that no data packets are in flight when
>>a sync barrier completes before we take a checkpoint. They
>>definitely add overhead for the simulation but they are necessary for
>>the correctness of the simulation.
>>
>> The receive thread in multi-gem5 reads out packets from the socket in
>> parallel with the simulation thread so packets normally will not be
>> "queueing up” before a sync barrier message. There is definitely
>> room for improvements in the current implementation for reducing the
>> synchronization overhead but that is likely true for pd-gem5, too.
>> The important thing here is that the solution must provide
>> correctness
>> (robustness) first.
>>
>> pd-gem5 provides correctness. Please read my previous comment. The
>> whole
>purpose of multi/pd-gem5 is to parallelize simulation with minimal
>overhead and gain speedup. If you fail to do so, nobody will use your
>tool.
>
>
>> >Also,
>> >multi-gem5 send huge sized messages (multiHeaderPkt) through network
>> >to perform each synchronization point, which increases
>> >synchronization overhead further. In pd-gem5, we choose to send just
>> >one character as
>>sync
>> >message through a separate socket to reduce synchronization overhead.
>>
>> The TCP/IP message size is unlikely the bottleneck here. Multi-gem5
>>will send ~50 bytes more in a sync barrier message than pd-gem5 but
>>that bigger sync message still fits into a single ethernet frame on
>>the wire. The end-to-end latency overhead that is caused by 50 bytes
>>extra payload for a small single frame TCP/IP message is likely to
>>fall into the “noise"
>> category if one tries to measure it in a real cluster.
>>
>> You should prove your hypothesis experimentally. Each gem5 process
>send/receive sync messages at the end of every quantum. Say you are
>simulating "N" node computer cluster with "M" different configuration.
>Then
>you will have N*M gem5 processes that send/receive these 50 Bytes (it
>think it's more) extra data at the same time over network ...
>
>Furthermore, multi-gem5 send a header before each data message.
>Comparing with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is
>12 least significant digits of the Tick) to each data packet. I don't
>know exactly how large are these "MultiHeaderPkt", but it just has two
>Tick field that each is 64 Bytes! Also, header packets are separate TCP
>packets, so you pay for sending two separate packets for each data
>packet. And worst, you serialize all of these with sync messages.
>
>
>> >
>> >* Packet handling.
>> >
>> >pd-gem5 uses EtherTap for data packets but changed the polling
>>mechanism
>> >
>> >to go through the main event queue. Since this rate is actually
>> >linked
>> >
>> >with simulator progress, it cannot guarantee that the packets are
>> >serviced
>> >
>> >at regular intervals of real time. This can lead to packets
>> >queueing
>>up
>> >
>> >which would contribute to the synchronization issues mentioned above.
>> >
>> >multi-gem5 uses plain sockets with separate receive threads and so
>> >does not
>> >
>> >have this issue.
>> >
>> >I think again you are pointing to your first concern that I¹ve
>>explained
>> >above. Packets that have queued up in EtherTap socket, will be
>>processed
>> >and delivered to simulation environment at the beginning of next
>> >simulation quantum.
>> >
>> >Please notice that multi-gem5 introduces a new simObjects to
>> >interface simulation environment to real world which is redundant.
>> >This functionality is already there by EtherTap.
>>
>> Except that the EtherTap solution does not provide a correct (robust)
>> solution for the synchronization problem.
>>
>> Please read my first/second comments.
>
>
>> >
>> >* Checkpoint accuracy.
>> >
>> >A user would like to have a checkpoint at precisely the time the
>> >
>> >'m5 checkpoint' operation is executed so as to not miss any of the
>> >
>> >area of interest in his application.
>> >
>> >pd-gem5 requires that simulation finish the current quantum
>> >
>> >before checkpointing, so it cannot provide this.
>> >
>> >(Shortening the quantum can help, but usually the snapshot is being
>>taken
>> >
>> >while 'fast-forwarding', i.e. simulating as fast as possible, which
>>would
>> >
>> >motivate a longer quantum.)
>> >
>> >multi-gem5 can enter the drain cycle immediately upon receiving a
>> >
>> >checkpoint request. We find this accuracy highly desirable.
>> >
>> >It¹s true that if you have a large quantum size then there would be
>>some
>> >discrepancy between the m5_ckpt instruction tick and the actual dump
>>tick.
>> >Based on multi-gem5 code, my understanding is that you send async
>> >checkpoint message as soon as one of the gem5 processes encounter
>>m5_ckpt
>> >instruction. But I¹m not sure how you fix the aforementioned issue,
>> >because you have to sync all gem5 processes before you start dumping
>>checkpoint,
>> >which necessitate a global synchronization beforehand.
>>
>> In multi-gem5, the gem5 process who encounters the m5_ckpt
>>instruction sends out an async checkpoint notification for the peer
>>gem5 processes and then it starts the draining immediately (at the
>>same tick). So the checkpoint will be taken at the exact tick form
>>the initiator process point of view. The global synchronisation with
>>the peer processes takes place while the initiator process is still
>>waiting at the same tick (i.e the simulation thread is suspended).
>>However, the receiver thread Continues reading out the socket -
>>while waiting for the global sync to
>> complete- to make sure that in-flight data packets from peer gem5
>>processes are stored properly and saved into the checkpoint.
>>
>>
>So you mean multi-gem5 ends up with having gem5 processes with
>different ticks after checkpoint? In pd-gem5 we make sure that all gem5
>processes start dumping checkpoint at the same tick. Are you sure that
>this is correct to have each gem5 process dump checkpoint at different ticks???
>
>I don't think this a correct checkpointing design. However, if you feel
>it is correct, I can change a couple of lines in "Simulation.py" and
>barrier scripts to implement the same functionality in pd-gem5. One
>thing that you are obsessed about is to make sure that there is no
>in-flight packets while we start dumping checkpoint, and you have all
>these complex mechanisms in place to insure that! I think you can
>99.99999% make sure that there is no in-flight packet by waiting for 1
>second after all gem5 processes finished their quantum simulation and
>then dump checkpoint. Do you really think that delivering a tcp packet
>would take more than 1 second in today's systems!?
>Always go for simple solutions ...
>
>
>
>> >
>> >By the way, we have a fix for this issue by introducing a new m5
>> >pseudo instruction.
>>
>> I fail to see how a new pseudo instruction can solve the problem of
>> completing the full quantum in pd-gem5 before a checkpoint can be taken.
>> Could you please elaborate on that?
>>
>> As we take checkpoint while fast-forwarding and it is likely that we
>>relax
>synchronization for speedup purpose, a new pseudo instruction that can
>set quantum size (m5_qset) can be helpful. So, one can insert m5_qset
>in his benchmark source code before entering ROI that contains m5_ckpt
>to decrease quantum size beforehand and reduce the discrepancy between
>m5_ckpt tick and actual checkpoint tick. This is not included in
>pd-gem5 patch right now.
>
>
>> >
>> >* Implementation of network topology.
>> >
>> >pd-gem5 uses a separate gem5 process to act as a switch whereas
>>multi-gem5
>> >
>> >uses a standalone packet relay process.
>> >
>> >We haven't measured the overhead of pd-gem5's simulated switch yet,
>> >but
>> >
>> >we're confident that our approach is at least as fast and more
>>scalable.
>> >
>> >There is this flexibility in pd-gem5 to simulate a switch box
>> >alongside one of the other gem5 processes. However, it might make
>> >that gem5 process
>>the
>> >simulation bottleneck. One of the advantages of pd-gem5 over
>>multi-gem5 is
>> >that we use gem5 to simulate a switch box, which allows us to model
>> >any network topology by instantiating several Switch simObjects and
>> >interconnect them with EhterLink in an arbitrary fashion. A
>> >standalone
>>tcp
>> >server just can provide switch functionality (forwarding packets to
>> >destinations) and model a star network topology. Furthermore, it
>> >cannot model various network timings such as queueing delay,
>> >congestion, and routing latency. Also it has some accuracy issues
>> >that I will point out next.
>>
>> I agree with the complex topology argument. We already mentioned that
>>before as an advantage for pd-gem5 from the point of view of future
>>extensions. However, I do not agree that multi-gem5 cannot model
>>queueing delays and congestions. For a simple crossbar switch, it can
>>model queueing delays and congestions, but the receive queues are
>>distributed among the
>> gem5 processes.
>>
>> It's true that you can model queuing delay of a simple crossbar by
>distributing queues across gem5 processes (end points). But to be able
>to do so you have to ensure the ordering of packets that you enqueue in
>the distributed queues. It is almost impossible without a synchronized
>switch box. You should have a reorder queue that reorders packets
>dynamically and updates timing parameter for each packet as well. I
>don't know how much progress have you had to ensure ordering scheme in
>multi-gem5 but you may already realized that how complex and error
>prone it can be. This argument is also related to my next argument for "Broken network timing".
>
>
>> >
>> >* Broken network timing:
>> >
>> >Forwarding packets between gem5 processes using a standalone tcp
>> >server can cause reordering between packets that have different
>> >source but same destination. It causes inaccurate network timing
>> >and worse of all non-deterministic simulation. pd-gem5 resolve this
>> >by reordering
>>packets
>> >at
>> >Switch process and then send them to their destination (it¹s
>> >possible
>>as
>> >switch is synchronized with the rest of the nodes).
>>
>> In multi-gem5, there is always a HeaderPkt that contains some meta
>>information for each data packet. The meta information include the
>>send tick and the sender rank (i.e. a unique ID of the sender gem5 process).
>> We use those information to define a well defined ordering of packets
>>even if packets are arriving at the same receiver from different
>>senders.
>>This
>> packet ordering scheme is still being tested so the corresponding
>>patch is not on the RB yet.
>>
>> Please read my previous comment. The most important part of
>>multi/pd-gem5
>extension is ensuring accurate and deterministic simulation.
>
>
>> >
>> >* Amount of changes
>> >
>> >pd-gem5 introduce different modes in etherlink just to provide
>> >accurate timing for each component in the network subsystem (NIC,
>> >link, switch)
>>as
>> >well as capability of modeling different network topologies (mesh,
>>ring,
>> >fat tree, etc). To enable a simple functionality, like what
>> >multi-gem5 provides, the amount of changes in gem5 can be limited to
>> >time-stamping packets and providing synchronization through python
>> >scripts. However,
>> >multi-gem5 re-implements functionalists that are already in gem5.
>>
>> This argument holds only if both implementations are correct (robust).
>>It
>> still seems to me that pd-gem5 does not provide correctness for the
>>synchronization/checkpointing parts.
>>
>> Again, please read my first comment for correctness of pd-gem5.
>
>
>> >
>> >* Integrating with gem5 mainstream:
>> >
>> >pd-gem5 launch script is written in python which is suited for
>>integration
>> >with gem5 python scripts. However multi-gem5 uses bash script. Also,
>>all
>> >source files in pd-gem5 are already parts of gem5 mainstream.
>> >However
>> >multi-gem5 has tcp_server.cc/hh that is a standalone process and
>> >cannot
>> be
>> >part of gem5.
>>
>> The multi-gem5 launch script is simply enough to rely only on the
>>shell. It can obviously be easily re-written in python if that added
>>any value.
>>The
>> tcp_server component is only a utility (like the "m5" utility that is
>>also part of gem5).
>>
>> The thing is that it's more likely that users want to add some
>functionality to the run-script of multi/pd-gem5. E.g. pd-gem5
>run-script supports launching simulations using a simulation pool
>management software ( http://research.cs.wisc.edu/htcondor/). Using
>python enables users to easily add these kind of supports.
>
>
>>
>> Cheers,
>> - Gabor
>>
>>
>> >On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham
>> ><***@arm.com>
>> >wrote:
>> >
>> >>Hello everyone,
>> >>We have taken a look at how pd-gem5 compares with multi-gem5.
>> >>While intending to deliver the same functionality, there are some
>> >>crucial differences:
>> >>
>> >>* Synchronization.
>> >>
>> >> pd-gem5 implements this in Python (not a problem in itself;
>> >>aesthetically
>> >> this is nice, but...). The issue is that pd-gem5's data
>> >>packets
>>and
>> >> barrier messages travel over different sockets. Since pd-gem5
>>could
>> >>see
>> >> data packets passing synchronization barriers, it could create an
>> >> inconsistent checkpoint.
>> >>
>> >> multi-gem5's synchronization is implemented in C++ using sync
>>events,
>> >>but
>> >> more importantly, the messages queue up in the same stream and
>> >>so cannot
>> >> have the issue just described. (Event ordering is often
>> >>crucial
>>in
>> >> snapshot protocols.) Therefore we feel that multi-gem5 is a
>> >>more robust
>> >> solution in this respect.
>> >>
>> >>* Packet handling.
>> >>
>> >> pd-gem5 uses EtherTap for data packets but changed the polling
>> >>mechanism
>> >> to go through the main event queue. Since this rate is
>> >>actually linked
>> >> with simulator progress, it cannot guarantee that the packets
>> >>are serviced
>> >> at regular intervals of real time. This can lead to packets
>> >>queueing up
>> >> which would contribute to the synchronization issues mentioned
>>above.
>> >>
>> >> multi-gem5 uses plain sockets with separate receive threads and
>> >>so does not
>> >> have this issue.
>> >>
>> >>* Checkpoint accuracy.
>> >>
>> >> A user would like to have a checkpoint at precisely the time the
>> >> 'm5 checkpoint' operation is executed so as to not miss any of the
>> >> area of interest in his application.
>> >>
>> >> pd-gem5 requires that simulation finish the current quantum
>> >> before checkpointing, so it cannot provide this.
>> >>
>> >> (Shortening the quantum can help, but usually the snapshot is
>> >>being taken
>> >> while 'fast-forwarding', i.e. simulating as fast as possible,
>> >>which would
>> >> motivate a longer quantum.)
>> >>
>> >> multi-gem5 can enter the drain cycle immediately upon receiving a
>> >> checkpoint request. We find this accuracy highly desirable.
>> >>
>> >>* Implementation of network topology.
>> >>
>> >> pd-gem5 uses a separate gem5 process to act as a switch whereas
>> >>multi-gem5
>> >> uses a standalone packet relay process.
>> >>
>> >> We haven't measured the overhead of pd-gem5's simulated switch
>> >>yet, but
>> >> we're confident that our approach is at least as fast and more
>> >>scalable.
>> >>
>> >>
>> >>Thanks,
>> >>Curtis
>> >>________________________________________
>> >>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad
>>Alian [
>> >>***@wisc.edu]
>> >>Sent: Friday, June 26, 2015 7:37 PM
>> >>To: gem5 Developer List
>> >>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
>> >>system on multiple physical hosts
>> >>
>> >>Hi Anthony,
>> >>
>> >>I think that would be a good option, then I can add pd-gem5
>> >>functionality on top of that. Right now I've simplified your
>> >>implementation. Also, I think I had found some bugs in your patch
>> >>that I cannot remember now.
>>If
>> >>you decided to ship EtherSwitch patch, let me know to give you a
>>review
>> >>on
>> >>that.
>> >>
>> >>Thanks,
>> >>Mohammad
>> >>
>> >>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
>> >>***@amd.com> wrote:
>> >>
>> >>>Would it make sense for me to ship the EtherSwitch patch first,
>> >>>since
>> >>it
>> >>>has utility on its own, and then we can decide which of the
>> >>"multi-gem5"
>> >>>approaches is best, or if it's some combination of both?
>> >>>
>> >>>The only reason I never shipped it was because Steve raised an
>> >>>issue
>> >>that
>> >>>I didn't have a good alternative for, and didn't have the time to
>>look
>> >>into
>> >>>one at that time.
>> >>>________________________________________
>> >>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
>> >>Alian [
>> >>>***@wisc.edu]
>> >>>Sent: Wednesday, June 24, 2015 12:43 PM
>> >>>To: gem5 Developer List
>> >>>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
>> >>system
>> >>>on multiple physical hosts
>> >>>
>> >>>Hi Andreas,
>> >>>
>> >>>Thanks for the comment.
>> >>>I think the checkpointing support in both works is the same. Here
>> >>>is
>> >>how
>> >>>checkpointing support is implemented in pd-gem5:
>> >>>
>> >>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
>> >>>instruction, it will send a ³recv-ckpt² signal to the ³barrier²
>> >>>process. Then the ³barrier² process sends a ³take-ckpt²
>> >>signal
>> >>to
>> >>>all the simulated nodes
>> >>>(including the node that encountered m5-checkpoint) at the end of
>> >>>the current simulation quantum. On the reception of ³take-ckpt²
>> >>>signal, gem5 processes start dumping check-points. This
>> >>makes
>> >>>each simulated node dump a checkpoint at the same simulated time
>> >>>point while ensuring there is no in-flight packets.
>> >>>
>> >>>I believe this is the same as multi-gem5 patch approach for
>>checkpoint
>> >>>support (based on the commit message of
>> >>http://reviews.gem5.org/r/2865/
>> >>).
>> >>>Also, we have tested our mechanism with several benchmarks and it
>> >>works.
>> >>As
>> >>>Steve suggested, I'll look into Curtis's patch and try to review
>> >>>it
>>as
>> >>>well.
>> >>>But as Nilay also mentioned earlier, there are some codes missing
>> >>>in Curtis's patch. I prefer to first run multi-gem5 before
>> >>>starting to
>> >>review
>> >>>it.
>> >>>
>> >>>Thank you,
>> >>>Mohammad
>> >>>
>> >>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
>> >>***@arm.com>
>> >>>wrote:
>> >>>
>> >>>>Hi Steve,
>> >>>>
>> >>>>Apologies for the confusion. We are on the same page. My point is
>> >>that
>> >>we
>> >>>>cannot simply take a little bit of patch A and a little bit of
>> >>patch B.
>> >>>>This change involves a lot of code, and we need to approach this
>> >>>>in
>> >>a
>> >>>>structured fashion. My proposal is to do it bottom up, and start
>> >>>>by getting the basic support in place. Since
>> >>>http://reviews.gem5.org/r/2826/
>> >>>>has already been on the review board for a few months, I am
>> >>>>merely suggesting that the it would be a good start to relate the
>> >>>>newly
>> >>posted
>> >>>>patches to what is already there.
>> >>>>
>> >>>>Andreas
>> >>>>
>> >>>>
>> >>>>
>> >>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
>> >>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
>> >>>>
>> >>>>>Hi Andreas,
>> >>>>>
>> >>>>>I'm a little confused by your email---you say you're
>> >>>>>fundamentally
>> >>>opposed
>> >>>>>to looking at both patches and picking the best features, then
>> >>>>>you
>> >>point
>> >>>>>out that the patches Curtis posted have the feature of better
>> >>>>>checkpointing support so we should pick that :).
>> >>>>>
>> >>>>>Obviously we can't just pick patch A from Mohammad's set and
>> >>>>>patch
>> >>B
>> >>>from
>> >>>>>Curtis's set and expect them to work together, but I think that
>> >>having
>> >>>>>both
>> >>>>>sets of patches available and comparing and contrasting the two
>> >>>>>implementations should enable us to get to a single
>> >>>>>implementation
>> >>>that's
>> >>>>>the best of both. Someone will have to make the effort of
>> >>integrating
>> >>>the
>> >>>>>better ideas from one set into the other set to create a new
>> >>unified
>> >>set
>> >>>>>of
>> >>>>>patches; (or maybe we commit one set and then integrate the best
>> >>>>>of
>> >>the
>> >>>>>other set as patches on top of that), but the first step is to
>> >>identify
>> >>>>>what "the best of both" is. Having Mohammad look at Curtis's
>> >>patches,
>> >>>and
>> >>>>>Curtis (or someone else from ARM) closely examine Mohammad's
>> >>patches
>> >>>would
>> >>>>>be a great start. I intend to review them both, though
>> >>unfortunately
>> >>my
>> >>>>>time has been scarce lately---I'm hoping to squeeze that in
>> >>>>>later
>> >>this
>> >>>>>week.
>> >>>>>
>> >>>>>Once we've had a few people look at both, we can discuss the
>> >>>>>pros
>> >>and
>> >>>cons
>> >>>>>of each, then discuss the strategy for getting the best features
>> >>in.
>> >>So
>> >>>>>far I've heard that Mohammad's patches have a better network
>> >>>>>model
>> >>but
>> >>>the
>> >>>>>ARM patches have better checkpointing support; that seems like a
>> >>good
>> >>>>>start.
>> >>>>>
>> >>>>>Steve
>> >>>>>
>> >>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
>> >>>***@arm.com
>> >>>>>
>> >>>>>wrote:
>> >>>>>
>> >>>>>>Hi all,
>> >>>>>>
>> >>>>>>Great work. However, I fundamentally do not believe in the
>> >>approach
>> >>of
>> >>>>>>Œletting reviewers pick the best features¹. There is no way we
>> >>would
>> >>>>>>ever
>> >>>>>>get something working out if it. We need to get _one_ working
>> >>solution
>> >>>>>>here, and figure out how to best get there. I would propose to
>> >>do it
>> >>>>>>bottom up, starting with the basic multi-simulator instance
>> >>support,
>> >>>>>>checkpointing support, and then move on to the network between
>> >>the
>> >>>>>>simulator instances.
>> >>>>>>
>> >>>>>>Thus, I propose we go with the low-level plumbing and
>> >>>>>>checkpoint
>> >>>support
>> >>>>>>from what Curtis has posted. I believe proper checkpointing
>> >>support
>> >>to
>> >>>>>>be
>> >>>>>>the most challenging, and from what I can tell this is far more
>> >>>limited
>> >>>>>>in
>> >>>>>>what you just posted Mohammad. Could you perhaps review Curtis
>> >>patches
>> >>>>>>based on your insights, and we can try and get these patches in
>> >>shape
>> >>>>>>and
>> >>>>>>committed asap.
>> >>>>>>
>> >>>>>>Once we have the baseline functionality in place, then we can
>> >>start
>> >>>>>>looking at the more elaborate network models.
>> >>>>>>
>> >>>>>>Does this sound reasonable?
>> >>>>>>
>> >>>>>>Thanks,
>> >>>>>>
>> >>>>>>Andreas
>> >>>>>>
>> >>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
>> >>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>> >>>>>>
>> >>>>>>>Hello All,
>> >>>>>>>
>> >>>>>>>I have submitted a chain of patches which enables gem5 to
>> >>simulate
>> >>a
>> >>>>>>>cluster on multiple physical hosts:
>> >>>>>>>
>> >>>>>>>http://reviews.gem5.org/r/2909/
>> >>>>>>>http://reviews.gem5.org/r/2910/
>> >>>>>>>http://reviews.gem5.org/r/2912/
>> >>>>>>>http://reviews.gem5.org/r/2913/
>> >>>>>>>http://reviews.gem5.org/r/2914/
>> >><http://reviews.gem5.org/r/2914/>
>> >>>>>>>
>> >>>>>>>and a patch that contains run scripts for a simple experiment:
>> >>>>>>>http://reviews.gem5.org/r/2915/
>> >>>>>>>
>> >>>>>>>We have run several benchmarks using this infrastructure,
>> >>including
>> >>>NAS
>> >>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
>> >>>>>>>(http://prof.ict.ac.cn/DCBench/),
>> >>>>>>>and would be happy to share scripts/diskimages.
>> >>>>>>>
>> >>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or
>> >>>>>>>less
>> >>the
>> >>>>>>same
>> >>>>>>>as
>> >>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
>> >>*network
>> >>>>>>model
>> >>>>>>>is
>> >>>>>>>more thorough; it also enables modeling different network
>> >>topologies.
>> >>>>>>>Having both set of changes together let reviewers to pick best
>> >>>features
>> >>>>>>>from both works.
>> >>>>>>>
>> >>>>>>>Thank you,
>> >>>>>>>Mohammad Alian
>> >>>>>>>_______________________________________________
>> >>>>>>>gem5-dev mailing list
>> >>>>>>>gem5-***@gem5.org
>> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>
>> >>>>>>
>> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>> >>attachments
>> >>>are
>> >>>>>>confidential and may also be privileged. If you are not the
>> >>intended
>> >>>>>>recipient, please notify the sender immediately and do not
>> >>disclose
>> >>>the
>> >>>>>>contents to any other person, use it for any purpose, or store
>> >>>>>>or
>> >>copy
>> >>>>>>the
>> >>>>>>information in any medium. Thank you.
>> >>>>>>
>> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>9NJ,
>> >>>>>>Registered in England & Wales, Company No: 2557590 ARM
>> >>>>>>Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>> >>CB1
>> >>>>>>9NJ,
>> >>>>>>Registered in England & Wales, Company No: 2548782
>> >>>>>>_______________________________________________
>> >>>>>>gem5-dev mailing list
>> >>>>>>gem5-***@gem5.org
>> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>
>> >>>>>_______________________________________________
>> >>>>>gem5-dev mailing list
>> >>>>>gem5-***@gem5.org
>> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>
>> >>>>
>> >>>>-- IMPORTANT NOTICE: The contents of this email and any
>> >>>>attachments
>> >>are
>> >>>>confidential and may also be privileged. If you are not the
>> >>>>intended recipient, please notify the sender immediately and do
>> >>>>not disclose
>> >>the
>> >>>>contents to any other person, use it for any purpose, or store or
>> >>copy
>> >>>the
>> >>>>information in any medium. Thank you.
>> >>>>
>> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>>>9NJ, Registered in England & Wales, Company No: 2557590 ARM
>> >>>>Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>9NJ,
>> >>>>Registered in England & Wales, Company No: 2548782
>> >>>>_______________________________________________
>> >>>>gem5-dev mailing list
>> >>>>gem5-***@gem5.org
>> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>
>> >>>_______________________________________________
>> >>>gem5-dev mailing list
>> >>>gem5-***@gem5.org
>> >>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>_______________________________________________
>> >>>gem5-dev mailing list
>> >>>gem5-***@gem5.org
>> >>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>
>> >>_______________________________________________
>> >>gem5-dev mailing list
>> >>gem5-***@gem5.org
>> >>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>
>> >>-- IMPORTANT NOTICE: The contents of this email and any attachments
>>are
>> >>confidential and may also be privileged. If you are not the
>> >>intended recipient, please notify the sender immediately and do not
>> >>disclose
>>the
>> >>contents to any other person, use it for any purpose, or store or
>> >>copy the information in any medium. Thank you.
>> >>
>> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>9NJ, Registered in England & Wales, Company No: 2557590 ARM
>> >>Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>9NJ, Registered in England & Wales, Company No: 2548782
>> >>
>> >>_______________________________________________
>> >>gem5-dev mailing list
>> >>gem5-***@gem5.org
>> >>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>
>> >_______________________________________________
>> >gem5-dev mailing list
>> >gem5-***@gem5.org
>> >http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> -- IMPORTANT NOTICE: The contents of this email and any attachments
>>are confidential and may also be privileged. If you are not the
>>intended recipient, please notify the sender immediately and do not
>>disclose the contents to any other person, use it for any purpose, or
>>store or copy the information in any medium. Thank you.
>>
>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>>Registered in England & Wales, Company No: 2557590 ARM Holdings plc,
>>Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in
>>England & Wales, Company No: 2548782
>>_______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>_______________________________________________
>gem5-dev mailing list
>gem5-***@gem5.org
>http://m5sim.org/mailman/listinfo/gem5-dev


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782 _______________________________________________
gem5-dev mailing list
gem5-***@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2015-07-02 18:20:36 UTC
Permalink
Hi everyone,

Sorry for taking so long to engage. This is a great development and I think
both these patches are terrific contributions. Thanks to Mohammad, Gabor,
and everyone else involved.

I agree with Andreas that we should start with some top-level goals &
assumptions, agree on those, and then we can sort out the detailed issues
based on a consistent view.

I definitely agree with Andreas's first two points. The third one seems a
little surprising; I'd like to hear more about the motivation before
expressing an opinion. I can see where non-synchronous checkpointing could
be useful, but it's also clear from the associated patch that it's not
trivial to implement either. How much would be lost by requiring a
synchronization before a checkpoint?

From my personal perspective, I would like to see whatever we do here be a
first step toward a more general distributed simulation platform. Both of
these patches seem pretty Ethernet-centric in different ways. This is not
terrible; part of the problem is that gem5's current internal networking
support is already overly Ethernet-centric IMO. But it would be nice to
avoid baking that in even further. Rather than assume I have understood all
the code completely, I'll phrase things in the form of questions, and
people can comment on how those questions would be answered in the context
of the two different approaches.

1. How much effort would be required to simulate a non-Ethernet network? My
impression is that pd-gem5 has a leg up here, since a gem5 switch model for
a non-Ethernet network (which you'd have to write anyway if you were
simulating a different network) could be used in place of the current
Ethernet switch, where for multi-gem5 I think that the
util/multi//tcp_server.cc code would have to be modified (i.e., there'd be
additional work above and beyond what you'd need to get the network modeled
in base gem5).

2. How much effort is required to run on a non-Ethernet network (or
equivalently using a non-sockets API)? The MultiIface/TCPIface split in
the multi-gem5 code looks like it addresses this nicely, but pd-gem5 seems
pretty tied to an Ethernet host fabric.

3. Do both of these patches work with the existing multithreaded
multiple-event-queue simulation? I think multi-gem5 does (though it would
be nice to have a confirmation), but it's not clear about pd-gem5. I don't
see a benefit to having multiple gem5 processes on a single host vs. a
single multithreaded gem5 process using the existing support. I think this
could be particularly valuable with a hierarchical network; e.g., maybe I
would want to model a rack in multithreaded mode on a single multicore
server, then use pd-gem5 or multi-gem5 to build up a simulation of multiple
racks. Would this work out of the box with either of these patches, and if
not, what would need to be done?

4. Is it possible to construct a single-process simulation model that's
identical to the distributed simulation? It would be very valuable for
verification to be able to take a single simulation run and do it both
within a single process and also across multiple processes and verify that
identical results are achieved. This seems like a big drawback to the
multi-gem5 tcp_server approach, IMO.

I'm definitely not saying that all these issues need to be resolved before
anything gets committed, but if we can agree that these are valid goals,
then we can evaluate detailed issues based on whether they move us toward
or away from those goals.

Thanks,

Steve


On Thu, Jul 2, 2015 at 8:34 AM Andreas Hansson <***@arm.com>
wrote:

> Hi all,
>
> I think we need to up-level this a bit. From our perspective (and I
> suspect in general):
>
> 1. Robustness is important. Having a design that _may_ break, however
> unlikely is simply not an option.
>
> 2. Performance and scaling is important. We can compare actual numbers
> here, and I am fairly sure the two solutions are on par. Let’s quantify
> that though.
>
> 3. Checkpointing must not rely on synchronicity. It is vital for several
> workloads that we can checkpoint the various gem5 instances at different
> Ticks (due to the way the workloads are constructed).
>
> Andreas
>
> On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
> <gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>
> >Thanks Gabor for the reply.
> >
> >I feel this conversation is useful as we can find out pros/cons of each
> >design.
> >Please find my response in-lined below.
> >
> >Thank you,
> >Mohammad
> >
> >On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com> wrote:
> >
> >> Hi All,
> >>
> >> Sorry for the missing indentation in my previous e-mail! (This was my
> >> first e-mail to the dev-list so I could not simply use “reply"). Below
> >>is
> >> the same message, hopefully in more readable form.
> >>
> >> ====================================
> >>
> >> Hi All,
> >>
> >> Thank you Mohammad for your elaboration on the issues!
> >>
> >> I have written most of the multi-gem5 patch so let me add some more
> >> clarifications and answer to your concerns. My comments are inline
> >>below.
> >>
> >> Thanks,
> >> - Gabor
> >>
> >> On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
> >>
> >> >Hi All,
> >> >
> >> >Curtis-Thank you for listing some of the differences. I was waiting for
> >> >the
> >> >completed multi-gem5 patch before I send my review. Please see my
> >>inline
> >> >response below. I¹ve addressed the concerns that you¹ve raised. Also,
> >>I¹ve
> >> >added a bit more to the comparison.
> >> >
> >> >-* Synchronization.
> >> >
> >> >pd-gem5 implements this in Python (not a problem in itself;
> >>aesthetically
> >> >
> >> >this is nice, but...). The issue is that pd-gem5's data packets and
> >> >
> >> >barrier messages travel over different sockets. Since pd-gem5 could
> >>see
> >> >
> >> >data packets passing synchronization barriers, it could create an
> >> >
> >> >inconsistent checkpoint.
> >> >
> >> >multi-gem5's synchronization is implemented in C++ using sync events,
> >>but
> >> >
> >> >more importantly, the messages queue up in the same stream and so
> >>cannot
> >> >
> >> >have the issue just described. (Event ordering is often crucial in
> >> >
> >> >snapshot protocols.) Therefore we feel that multi-gem5 is a more robust
> >> >
> >> >solution in this respect.
> >> >
> >> >Each packet in pd-gem5 has a time-stamp. So even if data packets pass
> >> >synchronization barriers (in another word data packets arrive early at
> >>the
> >> >destination node), destination node process packets based on their
> >> >timestamp. Actually allowing data packets to pass sync barriers is a
> >>nice
> >> >feature that can reduce the likelihood of late packet reception.
> >>Ordering
> >> >of data messages that flow over pd-gem5 nodes is also preserved in
> >>pd-gem5
> >> >implementation.
> >>
> >> This seems to be a misunderstanding. Maybe the wording was not precise
> >> before.The problem is not a data packet that “passing" a sync barrier
> >> but the other way around, a sync barrier that can pass a data packet
> >> (e.g. while the data packet is waiting in the host operating system
> >> socket layer). If that happens, the packet will arrive later than it
> >>was
> >> supposed to and it may miss the computed receive tick.
> >>
> >> For instance, let’s assume that the quantum coincides with the simulated
> >> Ether link delay. (This is the optimal choice of quantum to minimize the
> >> number of sync barriers.) If a data packet is sent right at the
> >>beginning
> >> of a quantum then this packet must arrive at the destination gem5
> >>process
> >> within the same quantum in order not to miss its receive tick at the
> >>very
> >> beginning of the next quantum. If the sync barrier can pass the data
> >>packet
> >> then the data packet may arrive only during the next quantum (or in
> >> extreme conditions even later than that) so when it arrives the receiver
> >> gem5 may pass already the receive tick.
> >>
> >> This argument makes more sense than the previous one. Note that gem5 is
> >>a
> >cycle accurate simulator and it runs orders of magnitude slower that real
> >hardware. So it's almost impossible that the flight time of packet through
> >real network turns to be more that simulation time of one quantum. We ran
> >a
> >set of experiments just for this purpose: with quantum size equal to
> >etherlink delay, we never got any late arrival violation (what you
> >described) for full NAS benchmarks suit (please refer to the paper).
> >
> >multi-gem5 is optimized for a case that almost never happens! and
> >scarifying speedup for no gain.
> >
> >
> >> Time-stamping does help with this issue. Also, if a data packet is
> >>waiting
> >> in the host operating system socket layer when the simulation thread
> >>exits
> >> to python to complete the next sync barrier then the packet will not go
> >> into the checkpoint that may follow that sync barrier.
> >>
> >> That's a good point. Current pd-gem5 checkpointing mechanism might miss
> >packets that have been sent during previous quantum and are waiting in OS
> >socket buffer. I should add some code inside ethertap serialization
> >function to drain ethertap socket before writing checkpoint. I will update
> >pd-gem5 patch accordingly.
> >
> >>
> >> >What you mentioned as an advantage for multi-gem5 is actually a key
> >> >disadvantage: buffering sync messages behind data packets can add up to
> >> >the
> >> >synchronization overhead and slow down simulation significantly.
> >>
> >> The purpose of sync messages is to make sure that the data packets
> >>arrive
> >> in time (in terms of simulated time) at the destination so they can be
> >> scheduled for being received at the proper computed tick. Sync messages
> >> also make sure that no data packets are in flight when a sync barrier
> >> completes before we take a checkpoint. They definitely add overhead for
> >> the simulation but they are necessary for the correctness of the
> >> simulation.
> >>
> >> The receive thread in multi-gem5 reads out packets from the socket in
> >> parallel with the simulation thread so packets normally will not be
> >> "queueing up” before a sync barrier message. There is definitely room
> >> for improvements in the current implementation for reducing the
> >> synchronization overhead but that is likely true for pd-gem5, too.
> >> The important thing here is that the solution must provide correctness
> >> (robustness) first.
> >>
> >> pd-gem5 provides correctness. Please read my previous comment. The whole
> >purpose of multi/pd-gem5 is to parallelize simulation with minimal
> >overhead
> >and gain speedup. If you fail to do so, nobody will use your tool.
> >
> >
> >> >Also,
> >> >multi-gem5 send huge sized messages (multiHeaderPkt) through network to
> >> >perform each synchronization point, which increases synchronization
> >> >overhead further. In pd-gem5, we choose to send just one character as
> >>sync
> >> >message through a separate socket to reduce synchronization overhead.
> >>
> >> The TCP/IP message size is unlikely the bottleneck here. Multi-gem5 will
> >> send ~50 bytes more in a sync barrier message than pd-gem5 but that
> >>bigger
> >> sync message still fits into a single ethernet frame on the wire. The
> >> end-to-end latency overhead that is caused by 50 bytes extra payload for
> >> a small single frame TCP/IP message is likely to fall into the “noise"
> >> category if one tries to measure it in a real cluster.
> >>
> >> You should prove your hypothesis experimentally. Each gem5 process
> >send/receive sync messages at the end of every quantum. Say you are
> >simulating "N" node computer cluster with "M" different configuration.
> >Then
> >you will have N*M gem5 processes that send/receive these 50 Bytes (it
> >think
> >it's more) extra data at the same time over network ...
> >
> >Furthermore, multi-gem5 send a header before each data message. Comparing
> >with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12 least
> >significant digits of the Tick) to each data packet. I don't know exactly
> >how large are these "MultiHeaderPkt", but it just has two Tick field that
> >each is 64 Bytes! Also, header packets are separate TCP packets, so you
> >pay
> >for sending two separate packets for each data packet. And worst, you
> >serialize all of these with sync messages.
> >
> >
> >> >
> >> >* Packet handling.
> >> >
> >> >pd-gem5 uses EtherTap for data packets but changed the polling
> >>mechanism
> >> >
> >> >to go through the main event queue. Since this rate is actually linked
> >> >
> >> >with simulator progress, it cannot guarantee that the packets are
> >> >serviced
> >> >
> >> >at regular intervals of real time. This can lead to packets queueing
> >>up
> >> >
> >> >which would contribute to the synchronization issues mentioned above.
> >> >
> >> >multi-gem5 uses plain sockets with separate receive threads and so does
> >> >not
> >> >
> >> >have this issue.
> >> >
> >> >I think again you are pointing to your first concern that I¹ve
> >>explained
> >> >above. Packets that have queued up in EtherTap socket, will be
> >>processed
> >> >and delivered to simulation environment at the beginning of next
> >> >simulation
> >> >quantum.
> >> >
> >> >Please notice that multi-gem5 introduces a new simObjects to interface
> >> >simulation environment to real world which is redundant. This
> >> >functionality
> >> >is already there by EtherTap.
> >>
> >> Except that the EtherTap solution does not provide a correct (robust)
> >> solution for the synchronization problem.
> >>
> >> Please read my first/second comments.
> >
> >
> >> >
> >> >* Checkpoint accuracy.
> >> >
> >> >A user would like to have a checkpoint at precisely the time the
> >> >
> >> >'m5 checkpoint' operation is executed so as to not miss any of the
> >> >
> >> >area of interest in his application.
> >> >
> >> >pd-gem5 requires that simulation finish the current quantum
> >> >
> >> >before checkpointing, so it cannot provide this.
> >> >
> >> >(Shortening the quantum can help, but usually the snapshot is being
> >>taken
> >> >
> >> >while 'fast-forwarding', i.e. simulating as fast as possible, which
> >>would
> >> >
> >> >motivate a longer quantum.)
> >> >
> >> >multi-gem5 can enter the drain cycle immediately upon receiving a
> >> >
> >> >checkpoint request. We find this accuracy highly desirable.
> >> >
> >> >It¹s true that if you have a large quantum size then there would be
> >>some
> >> >discrepancy between the m5_ckpt instruction tick and the actual dump
> >>tick.
> >> >Based on multi-gem5 code, my understanding is that you send async
> >> >checkpoint message as soon as one of the gem5 processes encounter
> >>m5_ckpt
> >> >instruction. But I¹m not sure how you fix the aforementioned issue,
> >> >because
> >> >you have to sync all gem5 processes before you start dumping
> >>checkpoint,
> >> >which necessitate a global synchronization beforehand.
> >>
> >> In multi-gem5, the gem5 process who encounters the m5_ckpt instruction
> >> sends out an async checkpoint notification for the peer gem5 processes
> >>and
> >> then it starts the draining immediately (at the same tick). So the
> >> checkpoint will be taken at the exact tick form the initiator process
> >> point of view. The global synchronisation with the peer processes takes
> >> place while the initiator process is still waiting at the same tick (i.e
> >> the simulation thread is suspended). However, the receiver thread
> >> Continues reading out the socket - while waiting for the global sync to
> >> complete- to make sure that in-flight data packets from peer gem5
> >>processes
> >> are stored properly and saved into the checkpoint.
> >>
> >>
> >So you mean multi-gem5 ends up with having gem5 processes with different
> >ticks after checkpoint? In pd-gem5 we make sure that all gem5 processes
> >start dumping checkpoint at the same tick. Are you sure that this is
> >correct to have each gem5 process dump checkpoint at different ticks???
> >
> >I don't think this a correct checkpointing design. However, if you feel it
> >is correct, I can change a couple of lines in "Simulation.py" and barrier
> >scripts to implement the same functionality in pd-gem5. One thing that you
> >are obsessed about is to make sure that there is no in-flight packets
> >while
> >we start dumping checkpoint, and you have all these complex mechanisms in
> >place to insure that! I think you can 99.99999% make sure that there is no
> >in-flight packet by waiting for 1 second after all gem5 processes finished
> >their quantum simulation and then dump checkpoint. Do you really think
> >that
> >delivering a tcp packet would take more than 1 second in today's systems!?
> >Always go for simple solutions ...
> >
> >
> >
> >> >
> >> >By the way, we have a fix for this issue by introducing a new m5 pseudo
> >> >instruction.
> >>
> >> I fail to see how a new pseudo instruction can solve the problem of
> >> completing the full quantum in pd-gem5 before a checkpoint can be taken.
> >> Could you please elaborate on that?
> >>
> >> As we take checkpoint while fast-forwarding and it is likely that we
> >>relax
> >synchronization for speedup purpose, a new pseudo instruction that can set
> >quantum size (m5_qset) can be helpful. So, one can insert m5_qset in his
> >benchmark source code before entering ROI that contains m5_ckpt to
> >decrease
> >quantum size beforehand and reduce the discrepancy between m5_ckpt tick
> >and
> >actual checkpoint tick. This is not included in pd-gem5 patch right now.
> >
> >
> >> >
> >> >* Implementation of network topology.
> >> >
> >> >pd-gem5 uses a separate gem5 process to act as a switch whereas
> >>multi-gem5
> >> >
> >> >uses a standalone packet relay process.
> >> >
> >> >We haven't measured the overhead of pd-gem5's simulated switch yet, but
> >> >
> >> >we're confident that our approach is at least as fast and more
> >>scalable.
> >> >
> >> >There is this flexibility in pd-gem5 to simulate a switch box alongside
> >> >one
> >> >of the other gem5 processes. However, it might make that gem5 process
> >>the
> >> >simulation bottleneck. One of the advantages of pd-gem5 over
> >>multi-gem5 is
> >> >that we use gem5 to simulate a switch box, which allows us to model any
> >> >network topology by instantiating several Switch simObjects and
> >> >interconnect them with EhterLink in an arbitrary fashion. A standalone
> >>tcp
> >> >server just can provide switch functionality (forwarding packets to
> >> >destinations) and model a star network topology. Furthermore, it cannot
> >> >model various network timings such as queueing delay, congestion, and
> >> >routing latency. Also it has some accuracy issues that I will point out
> >> >next.
> >>
> >> I agree with the complex topology argument. We already mentioned that
> >> before as an advantage for pd-gem5 from the point of view of future
> >> extensions. However, I do not agree that multi-gem5 cannot model
> >>queueing
> >> delays and congestions. For a simple crossbar switch, it can model
> >>queueing
> >> delays and congestions, but the receive queues are distributed among the
> >> gem5 processes.
> >>
> >> It's true that you can model queuing delay of a simple crossbar by
> >distributing queues across gem5 processes (end points). But to be able to
> >do so you have to ensure the ordering of packets that you enqueue in the
> >distributed queues. It is almost impossible without a synchronized switch
> >box. You should have a reorder queue that reorders packets dynamically and
> >updates timing parameter for each packet as well. I don't know how much
> >progress have you had to ensure ordering scheme in multi-gem5 but you may
> >already realized that how complex and error prone it can be. This argument
> >is also related to my next argument for "Broken network timing".
> >
> >
> >> >
> >> >* Broken network timing:
> >> >
> >> >Forwarding packets between gem5 processes using a standalone tcp server
> >> >can
> >> >cause reordering between packets that have different source but same
> >> >destination. It causes inaccurate network timing and worse of all
> >> >non-deterministic simulation. pd-gem5 resolve this by reordering
> >>packets
> >> >at
> >> >Switch process and then send them to their destination (it¹s possible
> >>as
> >> >switch is synchronized with the rest of the nodes).
> >>
> >> In multi-gem5, there is always a HeaderPkt that contains some meta
> >> information for each data packet. The meta information include the send
> >> tick and the sender rank (i.e. a unique ID of the sender gem5 process).
> >> We use those information to define a well defined ordering of packets
> >>even
> >> if packets are arriving at the same receiver from different senders.
> >>This
> >> packet ordering scheme is still being tested so the corresponding patch
> >>is
> >> not on the RB yet.
> >>
> >> Please read my previous comment. The most important part of
> >>multi/pd-gem5
> >extension is ensuring accurate and deterministic simulation.
> >
> >
> >> >
> >> >* Amount of changes
> >> >
> >> >pd-gem5 introduce different modes in etherlink just to provide accurate
> >> >timing for each component in the network subsystem (NIC, link, switch)
> >>as
> >> >well as capability of modeling different network topologies (mesh,
> >>ring,
> >> >fat tree, etc). To enable a simple functionality, like what multi-gem5
> >> >provides, the amount of changes in gem5 can be limited to time-stamping
> >> >packets and providing synchronization through python scripts. However,
> >> >multi-gem5 re-implements functionalists that are already in gem5.
> >>
> >> This argument holds only if both implementations are correct (robust).
> >>It
> >> still seems to me that pd-gem5 does not provide correctness for the
> >> synchronization/checkpointing parts.
> >>
> >> Again, please read my first comment for correctness of pd-gem5.
> >
> >
> >> >
> >> >* Integrating with gem5 mainstream:
> >> >
> >> >pd-gem5 launch script is written in python which is suited for
> >>integration
> >> >with gem5 python scripts. However multi-gem5 uses bash script. Also,
> >>all
> >> >source files in pd-gem5 are already parts of gem5 mainstream. However
> >> >multi-gem5 has tcp_server.cc/hh that is a standalone process and
> cannot
> >> be
> >> >part of gem5.
> >>
> >> The multi-gem5 launch script is simply enough to rely only on the
> >>shell. It
> >> can obviously be easily re-written in python if that added any value.
> >>The
> >> tcp_server component is only a utility (like the "m5" utility that is
> >>also
> >> part of gem5).
> >>
> >> The thing is that it's more likely that users want to add some
> >functionality to the run-script of multi/pd-gem5. E.g. pd-gem5 run-script
> >supports launching simulations using a simulation pool management
> >software (
> >http://research.cs.wisc.edu/htcondor/). Using python enables users to
> >easily add these kind of supports.
> >
> >
> >>
> >> Cheers,
> >> - Gabor
> >>
> >>
> >> >On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham <***@arm.com>
> >> >wrote:
> >> >
> >> >>Hello everyone,
> >> >>We have taken a look at how pd-gem5 compares with multi-gem5. While
> >> >>intending
> >> >>to deliver the same functionality, there are some crucial differences:
> >> >>
> >> >>* Synchronization.
> >> >>
> >> >> pd-gem5 implements this in Python (not a problem in itself;
> >> >>aesthetically
> >> >> this is nice, but...). The issue is that pd-gem5's data packets
> >>and
> >> >> barrier messages travel over different sockets. Since pd-gem5
> >>could
> >> >>see
> >> >> data packets passing synchronization barriers, it could create an
> >> >> inconsistent checkpoint.
> >> >>
> >> >> multi-gem5's synchronization is implemented in C++ using sync
> >>events,
> >> >>but
> >> >> more importantly, the messages queue up in the same stream and so
> >> >>cannot
> >> >> have the issue just described. (Event ordering is often crucial
> >>in
> >> >> snapshot protocols.) Therefore we feel that multi-gem5 is a more
> >> >>robust
> >> >> solution in this respect.
> >> >>
> >> >>* Packet handling.
> >> >>
> >> >> pd-gem5 uses EtherTap for data packets but changed the polling
> >> >>mechanism
> >> >> to go through the main event queue. Since this rate is actually
> >> >>linked
> >> >> with simulator progress, it cannot guarantee that the packets are
> >> >>serviced
> >> >> at regular intervals of real time. This can lead to packets
> >> >>queueing up
> >> >> which would contribute to the synchronization issues mentioned
> >>above.
> >> >>
> >> >> multi-gem5 uses plain sockets with separate receive threads and so
> >> >>does
> >> >>not
> >> >> have this issue.
> >> >>
> >> >>* Checkpoint accuracy.
> >> >>
> >> >> A user would like to have a checkpoint at precisely the time the
> >> >> 'm5 checkpoint' operation is executed so as to not miss any of the
> >> >> area of interest in his application.
> >> >>
> >> >> pd-gem5 requires that simulation finish the current quantum
> >> >> before checkpointing, so it cannot provide this.
> >> >>
> >> >> (Shortening the quantum can help, but usually the snapshot is being
> >> >>taken
> >> >> while 'fast-forwarding', i.e. simulating as fast as possible, which
> >> >>would
> >> >> motivate a longer quantum.)
> >> >>
> >> >> multi-gem5 can enter the drain cycle immediately upon receiving a
> >> >> checkpoint request. We find this accuracy highly desirable.
> >> >>
> >> >>* Implementation of network topology.
> >> >>
> >> >> pd-gem5 uses a separate gem5 process to act as a switch whereas
> >> >>multi-gem5
> >> >> uses a standalone packet relay process.
> >> >>
> >> >> We haven't measured the overhead of pd-gem5's simulated switch yet,
> >> >>but
> >> >> we're confident that our approach is at least as fast and more
> >> >>scalable.
> >> >>
> >> >>
> >> >>Thanks,
> >> >>Curtis
> >> >>________________________________________
> >> >>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad
> >>Alian [
> >> >>***@wisc.edu]
> >> >>Sent: Friday, June 26, 2015 7:37 PM
> >> >>To: gem5 Developer List
> >> >>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
> >> >>system
> >> >>on multiple physical hosts
> >> >>
> >> >>Hi Anthony,
> >> >>
> >> >>I think that would be a good option, then I can add pd-gem5
> >> >>functionality
> >> >>on top of that. Right now I've simplified your implementation. Also, I
> >> >>think I had found some bugs in your patch that I cannot remember now.
> >>If
> >> >>you decided to ship EtherSwitch patch, let me know to give you a
> >>review
> >> >>on
> >> >>that.
> >> >>
> >> >>Thanks,
> >> >>Mohammad
> >> >>
> >> >>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
> >> >>***@amd.com> wrote:
> >> >>
> >> >>>Would it make sense for me to ship the EtherSwitch patch first, since
> >> >>it
> >> >>>has utility on its own, and then we can decide which of the
> >> >>"multi-gem5"
> >> >>>approaches is best, or if it's some combination of both?
> >> >>>
> >> >>>The only reason I never shipped it was because Steve raised an issue
> >> >>that
> >> >>>I didn't have a good alternative for, and didn't have the time to
> >>look
> >> >>into
> >> >>>one at that time.
> >> >>>________________________________________
> >> >>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
> >> >>Alian [
> >> >>>***@wisc.edu]
> >> >>>Sent: Wednesday, June 24, 2015 12:43 PM
> >> >>>To: gem5 Developer List
> >> >>>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
> >> >>system
> >> >>>on multiple physical hosts
> >> >>>
> >> >>>Hi Andreas,
> >> >>>
> >> >>>Thanks for the comment.
> >> >>>I think the checkpointing support in both works is the same. Here is
> >> >>how
> >> >>>checkpointing support is implemented in pd-gem5:
> >> >>>
> >> >>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> >> >>>instruction, it will send a ³recv-ckpt² signal to the
> >> >>>³barrier² process. Then the ³barrier² process sends a ³take-ckpt²
> >> >>signal
> >> >>to
> >> >>>all the simulated nodes
> >> >>>(including the node that encountered m5-checkpoint) at the end of the
> >> >>>current simulation quantum. On the reception of
> >> >>>³take-ckpt² signal, gem5 processes start dumping check-points. This
> >> >>makes
> >> >>>each simulated node dump a checkpoint
> >> >>>at the same simulated time point while ensuring there is no in-flight
> >> >>>packets.
> >> >>>
> >> >>>I believe this is the same as multi-gem5 patch approach for
> >>checkpoint
> >> >>>support (based on the commit message of
> >> >>http://reviews.gem5.org/r/2865/
> >> >>).
> >> >>>Also, we have tested our mechanism with several benchmarks and it
> >> >>works.
> >> >>As
> >> >>>Steve suggested, I'll look into Curtis's patch and try to review it
> >>as
> >> >>>well.
> >> >>>But as Nilay also mentioned earlier, there are some codes missing in
> >> >>>Curtis's patch. I prefer to first run multi-gem5 before starting to
> >> >>review
> >> >>>it.
> >> >>>
> >> >>>Thank you,
> >> >>>Mohammad
> >> >>>
> >> >>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
> >> >>***@arm.com>
> >> >>>wrote:
> >> >>>
> >> >>>>Hi Steve,
> >> >>>>
> >> >>>>Apologies for the confusion. We are on the same page. My point is
> >> >>that
> >> >>we
> >> >>>>cannot simply take a little bit of patch A and a little bit of
> >> >>patch B.
> >> >>>>This change involves a lot of code, and we need to approach this in
> >> >>a
> >> >>>>structured fashion. My proposal is to do it bottom up, and start by
> >> >>>>getting the basic support in place. Since
> >> >>>http://reviews.gem5.org/r/2826/
> >> >>>>has already been on the review board for a few months, I am merely
> >> >>>>suggesting that the it would be a good start to relate the newly
> >> >>posted
> >> >>>>patches to what is already there.
> >> >>>>
> >> >>>>Andreas
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> >> >>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
> >> >>>>
> >> >>>>>Hi Andreas,
> >> >>>>>
> >> >>>>>I'm a little confused by your email---you say you're fundamentally
> >> >>>opposed
> >> >>>>>to looking at both patches and picking the best features, then you
> >> >>point
> >> >>>>>out that the patches Curtis posted have the feature of better
> >> >>>>>checkpointing
> >> >>>>>support so we should pick that :).
> >> >>>>>
> >> >>>>>Obviously we can't just pick patch A from Mohammad's set and patch
> >> >>B
> >> >>>from
> >> >>>>>Curtis's set and expect them to work together, but I think that
> >> >>having
> >> >>>>>both
> >> >>>>>sets of patches available and comparing and contrasting the two
> >> >>>>>implementations should enable us to get to a single implementation
> >> >>>that's
> >> >>>>>the best of both. Someone will have to make the effort of
> >> >>integrating
> >> >>>the
> >> >>>>>better ideas from one set into the other set to create a new
> >> >>unified
> >> >>set
> >> >>>>>of
> >> >>>>>patches; (or maybe we commit one set and then integrate the best of
> >> >>the
> >> >>>>>other set as patches on top of that), but the first step is to
> >> >>identify
> >> >>>>>what "the best of both" is. Having Mohammad look at Curtis's
> >> >>patches,
> >> >>>and
> >> >>>>>Curtis (or someone else from ARM) closely examine Mohammad's
> >> >>patches
> >> >>>would
> >> >>>>>be a great start. I intend to review them both, though
> >> >>unfortunately
> >> >>my
> >> >>>>>time has been scarce lately---I'm hoping to squeeze that in later
> >> >>this
> >> >>>>>week.
> >> >>>>>
> >> >>>>>Once we've had a few people look at both, we can discuss the pros
> >> >>and
> >> >>>cons
> >> >>>>>of each, then discuss the strategy for getting the best features
> >> >>in.
> >> >>So
> >> >>>>>far I've heard that Mohammad's patches have a better network model
> >> >>but
> >> >>>the
> >> >>>>>ARM patches have better checkpointing support; that seems like a
> >> >>good
> >> >>>>>start.
> >> >>>>>
> >> >>>>>Steve
> >> >>>>>
> >> >>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> >> >>>***@arm.com
> >> >>>>>
> >> >>>>>wrote:
> >> >>>>>
> >> >>>>>>Hi all,
> >> >>>>>>
> >> >>>>>>Great work. However, I fundamentally do not believe in the
> >> >>approach
> >> >>of
> >> >>>>>>Œletting reviewers pick the best features¹. There is no way we
> >> >>would
> >> >>>>>>ever
> >> >>>>>>get something working out if it. We need to get _one_ working
> >> >>solution
> >> >>>>>>here, and figure out how to best get there. I would propose to
> >> >>do it
> >> >>>>>>bottom up, starting with the basic multi-simulator instance
> >> >>support,
> >> >>>>>>checkpointing support, and then move on to the network between
> >> >>the
> >> >>>>>>simulator instances.
> >> >>>>>>
> >> >>>>>>Thus, I propose we go with the low-level plumbing and checkpoint
> >> >>>support
> >> >>>>>>from what Curtis has posted. I believe proper checkpointing
> >> >>support
> >> >>to
> >> >>>>>>be
> >> >>>>>>the most challenging, and from what I can tell this is far more
> >> >>>limited
> >> >>>>>>in
> >> >>>>>>what you just posted Mohammad. Could you perhaps review Curtis
> >> >>patches
> >> >>>>>>based on your insights, and we can try and get these patches in
> >> >>shape
> >> >>>>>>and
> >> >>>>>>committed asap.
> >> >>>>>>
> >> >>>>>>Once we have the baseline functionality in place, then we can
> >> >>start
> >> >>>>>>looking at the more elaborate network models.
> >> >>>>>>
> >> >>>>>>Does this sound reasonable?
> >> >>>>>>
> >> >>>>>>Thanks,
> >> >>>>>>
> >> >>>>>>Andreas
> >> >>>>>>
> >> >>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> >> >>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> >> >>>>>>
> >> >>>>>>>Hello All,
> >> >>>>>>>
> >> >>>>>>>I have submitted a chain of patches which enables gem5 to
> >> >>simulate
> >> >>a
> >> >>>>>>>cluster on multiple physical hosts:
> >> >>>>>>>
> >> >>>>>>>http://reviews.gem5.org/r/2909/
> >> >>>>>>>http://reviews.gem5.org/r/2910/
> >> >>>>>>>http://reviews.gem5.org/r/2912/
> >> >>>>>>>http://reviews.gem5.org/r/2913/
> >> >>>>>>>http://reviews.gem5.org/r/2914/
> >> >><http://reviews.gem5.org/r/2914/>
> >> >>>>>>>
> >> >>>>>>>and a patch that contains run scripts for a simple experiment:
> >> >>>>>>>http://reviews.gem5.org/r/2915/
> >> >>>>>>>
> >> >>>>>>>We have run several benchmarks using this infrastructure,
> >> >>including
> >> >>>NAS
> >> >>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
> >> >>>>>>>(http://prof.ict.ac.cn/DCBench/),
> >> >>>>>>>and would be happy to share scripts/diskimages.
> >> >>>>>>>
> >> >>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or less
> >> >>the
> >> >>>>>>same
> >> >>>>>>>as
> >> >>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
> >> >>*network
> >> >>>>>>model
> >> >>>>>>>is
> >> >>>>>>>more thorough; it also enables modeling different network
> >> >>topologies.
> >> >>>>>>>Having both set of changes together let reviewers to pick best
> >> >>>features
> >> >>>>>>>from both works.
> >> >>>>>>>
> >> >>>>>>>Thank you,
> >> >>>>>>>Mohammad Alian
> >> >>>>>>>_______________________________________________
> >> >>>>>>>gem5-dev mailing list
> >> >>>>>>>gem5-***@gem5.org
> >> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >> >>attachments
> >> >>>are
> >> >>>>>>confidential and may also be privileged. If you are not the
> >> >>intended
> >> >>>>>>recipient, please notify the sender immediately and do not
> >> >>disclose
> >> >>>the
> >> >>>>>>contents to any other person, use it for any purpose, or store or
> >> >>copy
> >> >>>>>>the
> >> >>>>>>information in any medium. Thank you.
> >> >>>>>>
> >> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >> >>9NJ,
> >> >>>>>>Registered in England & Wales, Company No: 2557590
> >> >>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> >> >>CB1
> >> >>>>>>9NJ,
> >> >>>>>>Registered in England & Wales, Company No: 2548782
> >> >>>>>>_______________________________________________
> >> >>>>>>gem5-dev mailing list
> >> >>>>>>gem5-***@gem5.org
> >> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>
> >> >>>>>_______________________________________________
> >> >>>>>gem5-dev mailing list
> >> >>>>>gem5-***@gem5.org
> >> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>
> >> >>>>
> >> >>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
> >> >>are
> >> >>>>confidential and may also be privileged. If you are not the intended
> >> >>>>recipient, please notify the sender immediately and do not disclose
> >> >>the
> >> >>>>contents to any other person, use it for any purpose, or store or
> >> >>copy
> >> >>>the
> >> >>>>information in any medium. Thank you.
> >> >>>>
> >> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >> >>>>Registered in England & Wales, Company No: 2557590
> >> >>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >> >>9NJ,
> >> >>>>Registered in England & Wales, Company No: 2548782
> >> >>>>_______________________________________________
> >> >>>>gem5-dev mailing list
> >> >>>>gem5-***@gem5.org
> >> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>
> >> >>>_______________________________________________
> >> >>>gem5-dev mailing list
> >> >>>gem5-***@gem5.org
> >> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>_______________________________________________
> >> >>>gem5-dev mailing list
> >> >>>gem5-***@gem5.org
> >> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>
> >> >>_______________________________________________
> >> >>gem5-dev mailing list
> >> >>gem5-***@gem5.org
> >> >>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>
> >> >>-- IMPORTANT NOTICE: The contents of this email and any attachments
> >>are
> >> >>confidential and may also be privileged. If you are not the intended
> >> >>recipient, please notify the sender immediately and do not disclose
> >>the
> >> >>contents to any other person, use it for any purpose, or store or copy
> >> >>the
> >> >>information in any medium. Thank you.
> >> >>
> >> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >> >>Registered in England & Wales, Company No: 2557590
> >> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >> >>9NJ,
> >> >>Registered in England & Wales, Company No: 2548782
> >> >>
> >> >>_______________________________________________
> >> >>gem5-dev mailing list
> >> >>gem5-***@gem5.org
> >> >>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>
> >> >_______________________________________________
> >> >gem5-dev mailing list
> >> >gem5-***@gem5.org
> >> >http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> >> confidential and may also be privileged. If you are not the intended
> >> recipient, please notify the sender immediately and do not disclose the
> >> contents to any other person, use it for any purpose, or store or copy
> >>the
> >> information in any medium. Thank you.
> >>
> >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >> Registered in England & Wales, Company No: 2557590
> >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >> Registered in England & Wales, Company No: 2548782
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-***@gem5.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >_______________________________________________
> >gem5-dev mailing list
> >gem5-***@gem5.org
> >http://m5sim.org/mailman/listinfo/gem5-dev
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Mohammad Alian
2015-07-02 22:38:09 UTC
Permalink
Hi all,

Thank you Steve for you insightful comments.

I do agree with Andreas's first two points, too. Just add one comment to
his first point, even if the violation for packet arrival happens, which is
unlikely, you can abort simulation (or continue but do not expect
determinism).

Regarding 2, pd-gem5 relies on EtherTap socket interface to connect gem5
processes together. To support a new API, we should change EtherTap
interface. Referring to your 3'rd point, if we could integrate
multi-threaded gem5 with multi/pd-gem5, then the current socket interface
would suffice.

Regarding 3, this is a very good point. Actually I meant to setup what you
mentioned with pd-gem5. I believe that this is the right use-case for
multi/pd-gem5. The network model of pd-gem5 best fit for integration of
multi-threaded gem5. Each multi-threaded gem5 needs a top of rack switch
model that has one uplink port that connects the top of rack switch to the
next switch in the hierarchy, which can be another top of rack switch that
is simulated in another multi-threaded gem5 process or a separate switch
box model. I think it is possible with pd-gem5 but haven't tested it yet.
I'll take a closer look into it and let you know.

Regarding 4, It is possible to replicate pd-gem5 distributed simulation
under single threaded gem5.

I think this coincident in having both patches at the same time in
review-board it is a good opportunity for gem5 community to take advantage
of. We can approach to the goals that Steve mentioned by collaboration and
unbiased points of view.

Thank you,
Mohammad


On Thu, Jul 2, 2015 at 1:20 PM, Steve Reinhardt <***@gmail.com> wrote:

> Hi everyone,
>
> Sorry for taking so long to engage. This is a great development and I think
> both these patches are terrific contributions. Thanks to Mohammad, Gabor,
> and everyone else involved.
>
> I agree with Andreas that we should start with some top-level goals &
> assumptions, agree on those, and then we can sort out the detailed issues
> based on a consistent view.
>
> I definitely agree with Andreas's first two points. The third one seems a
> little surprising; I'd like to hear more about the motivation before
> expressing an opinion. I can see where non-synchronous checkpointing could
> be useful, but it's also clear from the associated patch that it's not
> trivial to implement either. How much would be lost by requiring a
> synchronization before a checkpoint?
>
> From my personal perspective, I would like to see whatever we do here be a
> first step toward a more general distributed simulation platform. Both of
> these patches seem pretty Ethernet-centric in different ways. This is not
> terrible; part of the problem is that gem5's current internal networking
> support is already overly Ethernet-centric IMO. But it would be nice to
> avoid baking that in even further. Rather than assume I have understood all
> the code completely, I'll phrase things in the form of questions, and
> people can comment on how those questions would be answered in the context
> of the two different approaches.
>
> 1. How much effort would be required to simulate a non-Ethernet network? My
> impression is that pd-gem5 has a leg up here, since a gem5 switch model for
> a non-Ethernet network (which you'd have to write anyway if you were
> simulating a different network) could be used in place of the current
> Ethernet switch, where for multi-gem5 I think that the
> util/multi//tcp_server.cc code would have to be modified (i.e., there'd be
> additional work above and beyond what you'd need to get the network modeled
> in base gem5).
>
> 2. How much effort is required to run on a non-Ethernet network (or
> equivalently using a non-sockets API)? The MultiIface/TCPIface split in
> the multi-gem5 code looks like it addresses this nicely, but pd-gem5 seems
> pretty tied to an Ethernet host fabric.
>
> 3. Do both of these patches work with the existing multithreaded
> multiple-event-queue simulation? I think multi-gem5 does (though it would
> be nice to have a confirmation), but it's not clear about pd-gem5. I don't
> see a benefit to having multiple gem5 processes on a single host vs. a
> single multithreaded gem5 process using the existing support. I think this
> could be particularly valuable with a hierarchical network; e.g., maybe I
> would want to model a rack in multithreaded mode on a single multicore
> server, then use pd-gem5 or multi-gem5 to build up a simulation of multiple
> racks. Would this work out of the box with either of these patches, and if
> not, what would need to be done?
>
> 4. Is it possible to construct a single-process simulation model that's
> identical to the distributed simulation? It would be very valuable for
> verification to be able to take a single simulation run and do it both
> within a single process and also across multiple processes and verify that
> identical results are achieved. This seems like a big drawback to the
> multi-gem5 tcp_server approach, IMO.
>
> I'm definitely not saying that all these issues need to be resolved before
> anything gets committed, but if we can agree that these are valid goals,
> then we can evaluate detailed issues based on whether they move us toward
> or away from those goals.
>
> Thanks,
>
> Steve
>
>
> On Thu, Jul 2, 2015 at 8:34 AM Andreas Hansson <***@arm.com>
> wrote:
>
> > Hi all,
> >
> > I think we need to up-level this a bit. From our perspective (and I
> > suspect in general):
> >
> > 1. Robustness is important. Having a design that _may_ break, however
> > unlikely is simply not an option.
> >
> > 2. Performance and scaling is important. We can compare actual numbers
> > here, and I am fairly sure the two solutions are on par. Let’s quantify
> > that though.
> >
> > 3. Checkpointing must not rely on synchronicity. It is vital for several
> > workloads that we can checkpoint the various gem5 instances at different
> > Ticks (due to the way the workloads are constructed).
> >
> > Andreas
> >
> > On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
> > <gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> >
> > >Thanks Gabor for the reply.
> > >
> > >I feel this conversation is useful as we can find out pros/cons of each
> > >design.
> > >Please find my response in-lined below.
> > >
> > >Thank you,
> > >Mohammad
> > >
> > >On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com>
> wrote:
> > >
> > >> Hi All,
> > >>
> > >> Sorry for the missing indentation in my previous e-mail! (This was my
> > >> first e-mail to the dev-list so I could not simply use “reply"). Below
> > >>is
> > >> the same message, hopefully in more readable form.
> > >>
> > >> ====================================
> > >>
> > >> Hi All,
> > >>
> > >> Thank you Mohammad for your elaboration on the issues!
> > >>
> > >> I have written most of the multi-gem5 patch so let me add some more
> > >> clarifications and answer to your concerns. My comments are inline
> > >>below.
> > >>
> > >> Thanks,
> > >> - Gabor
> > >>
> > >> On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
> > >>
> > >> >Hi All,
> > >> >
> > >> >Curtis-Thank you for listing some of the differences. I was waiting
> for
> > >> >the
> > >> >completed multi-gem5 patch before I send my review. Please see my
> > >>inline
> > >> >response below. I¹ve addressed the concerns that you¹ve raised. Also,
> > >>I¹ve
> > >> >added a bit more to the comparison.
> > >> >
> > >> >-* Synchronization.
> > >> >
> > >> >pd-gem5 implements this in Python (not a problem in itself;
> > >>aesthetically
> > >> >
> > >> >this is nice, but...). The issue is that pd-gem5's data packets and
> > >> >
> > >> >barrier messages travel over different sockets. Since pd-gem5 could
> > >>see
> > >> >
> > >> >data packets passing synchronization barriers, it could create an
> > >> >
> > >> >inconsistent checkpoint.
> > >> >
> > >> >multi-gem5's synchronization is implemented in C++ using sync events,
> > >>but
> > >> >
> > >> >more importantly, the messages queue up in the same stream and so
> > >>cannot
> > >> >
> > >> >have the issue just described. (Event ordering is often crucial in
> > >> >
> > >> >snapshot protocols.) Therefore we feel that multi-gem5 is a more
> robust
> > >> >
> > >> >solution in this respect.
> > >> >
> > >> >Each packet in pd-gem5 has a time-stamp. So even if data packets pass
> > >> >synchronization barriers (in another word data packets arrive early
> at
> > >>the
> > >> >destination node), destination node process packets based on their
> > >> >timestamp. Actually allowing data packets to pass sync barriers is a
> > >>nice
> > >> >feature that can reduce the likelihood of late packet reception.
> > >>Ordering
> > >> >of data messages that flow over pd-gem5 nodes is also preserved in
> > >>pd-gem5
> > >> >implementation.
> > >>
> > >> This seems to be a misunderstanding. Maybe the wording was not precise
> > >> before.The problem is not a data packet that “passing" a sync barrier
> > >> but the other way around, a sync barrier that can pass a data packet
> > >> (e.g. while the data packet is waiting in the host operating system
> > >> socket layer). If that happens, the packet will arrive later than it
> > >>was
> > >> supposed to and it may miss the computed receive tick.
> > >>
> > >> For instance, let’s assume that the quantum coincides with the
> simulated
> > >> Ether link delay. (This is the optimal choice of quantum to minimize
> the
> > >> number of sync barriers.) If a data packet is sent right at the
> > >>beginning
> > >> of a quantum then this packet must arrive at the destination gem5
> > >>process
> > >> within the same quantum in order not to miss its receive tick at the
> > >>very
> > >> beginning of the next quantum. If the sync barrier can pass the data
> > >>packet
> > >> then the data packet may arrive only during the next quantum (or in
> > >> extreme conditions even later than that) so when it arrives the
> receiver
> > >> gem5 may pass already the receive tick.
> > >>
> > >> This argument makes more sense than the previous one. Note that gem5
> is
> > >>a
> > >cycle accurate simulator and it runs orders of magnitude slower that
> real
> > >hardware. So it's almost impossible that the flight time of packet
> through
> > >real network turns to be more that simulation time of one quantum. We
> ran
> > >a
> > >set of experiments just for this purpose: with quantum size equal to
> > >etherlink delay, we never got any late arrival violation (what you
> > >described) for full NAS benchmarks suit (please refer to the paper).
> > >
> > >multi-gem5 is optimized for a case that almost never happens! and
> > >scarifying speedup for no gain.
> > >
> > >
> > >> Time-stamping does help with this issue. Also, if a data packet is
> > >>waiting
> > >> in the host operating system socket layer when the simulation thread
> > >>exits
> > >> to python to complete the next sync barrier then the packet will not
> go
> > >> into the checkpoint that may follow that sync barrier.
> > >>
> > >> That's a good point. Current pd-gem5 checkpointing mechanism might
> miss
> > >packets that have been sent during previous quantum and are waiting in
> OS
> > >socket buffer. I should add some code inside ethertap serialization
> > >function to drain ethertap socket before writing checkpoint. I will
> update
> > >pd-gem5 patch accordingly.
> > >
> > >>
> > >> >What you mentioned as an advantage for multi-gem5 is actually a key
> > >> >disadvantage: buffering sync messages behind data packets can add up
> to
> > >> >the
> > >> >synchronization overhead and slow down simulation significantly.
> > >>
> > >> The purpose of sync messages is to make sure that the data packets
> > >>arrive
> > >> in time (in terms of simulated time) at the destination so they can be
> > >> scheduled for being received at the proper computed tick. Sync
> messages
> > >> also make sure that no data packets are in flight when a sync barrier
> > >> completes before we take a checkpoint. They definitely add overhead
> for
> > >> the simulation but they are necessary for the correctness of the
> > >> simulation.
> > >>
> > >> The receive thread in multi-gem5 reads out packets from the socket in
> > >> parallel with the simulation thread so packets normally will not be
> > >> "queueing up” before a sync barrier message. There is definitely room
> > >> for improvements in the current implementation for reducing the
> > >> synchronization overhead but that is likely true for pd-gem5, too.
> > >> The important thing here is that the solution must provide correctness
> > >> (robustness) first.
> > >>
> > >> pd-gem5 provides correctness. Please read my previous comment. The
> whole
> > >purpose of multi/pd-gem5 is to parallelize simulation with minimal
> > >overhead
> > >and gain speedup. If you fail to do so, nobody will use your tool.
> > >
> > >
> > >> >Also,
> > >> >multi-gem5 send huge sized messages (multiHeaderPkt) through network
> to
> > >> >perform each synchronization point, which increases synchronization
> > >> >overhead further. In pd-gem5, we choose to send just one character as
> > >>sync
> > >> >message through a separate socket to reduce synchronization overhead.
> > >>
> > >> The TCP/IP message size is unlikely the bottleneck here. Multi-gem5
> will
> > >> send ~50 bytes more in a sync barrier message than pd-gem5 but that
> > >>bigger
> > >> sync message still fits into a single ethernet frame on the wire. The
> > >> end-to-end latency overhead that is caused by 50 bytes extra payload
> for
> > >> a small single frame TCP/IP message is likely to fall into the “noise"
> > >> category if one tries to measure it in a real cluster.
> > >>
> > >> You should prove your hypothesis experimentally. Each gem5 process
> > >send/receive sync messages at the end of every quantum. Say you are
> > >simulating "N" node computer cluster with "M" different configuration.
> > >Then
> > >you will have N*M gem5 processes that send/receive these 50 Bytes (it
> > >think
> > >it's more) extra data at the same time over network ...
> > >
> > >Furthermore, multi-gem5 send a header before each data message.
> Comparing
> > >with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12 least
> > >significant digits of the Tick) to each data packet. I don't know
> exactly
> > >how large are these "MultiHeaderPkt", but it just has two Tick field
> that
> > >each is 64 Bytes! Also, header packets are separate TCP packets, so you
> > >pay
> > >for sending two separate packets for each data packet. And worst, you
> > >serialize all of these with sync messages.
> > >
> > >
> > >> >
> > >> >* Packet handling.
> > >> >
> > >> >pd-gem5 uses EtherTap for data packets but changed the polling
> > >>mechanism
> > >> >
> > >> >to go through the main event queue. Since this rate is actually
> linked
> > >> >
> > >> >with simulator progress, it cannot guarantee that the packets are
> > >> >serviced
> > >> >
> > >> >at regular intervals of real time. This can lead to packets queueing
> > >>up
> > >> >
> > >> >which would contribute to the synchronization issues mentioned above.
> > >> >
> > >> >multi-gem5 uses plain sockets with separate receive threads and so
> does
> > >> >not
> > >> >
> > >> >have this issue.
> > >> >
> > >> >I think again you are pointing to your first concern that I¹ve
> > >>explained
> > >> >above. Packets that have queued up in EtherTap socket, will be
> > >>processed
> > >> >and delivered to simulation environment at the beginning of next
> > >> >simulation
> > >> >quantum.
> > >> >
> > >> >Please notice that multi-gem5 introduces a new simObjects to
> interface
> > >> >simulation environment to real world which is redundant. This
> > >> >functionality
> > >> >is already there by EtherTap.
> > >>
> > >> Except that the EtherTap solution does not provide a correct (robust)
> > >> solution for the synchronization problem.
> > >>
> > >> Please read my first/second comments.
> > >
> > >
> > >> >
> > >> >* Checkpoint accuracy.
> > >> >
> > >> >A user would like to have a checkpoint at precisely the time the
> > >> >
> > >> >'m5 checkpoint' operation is executed so as to not miss any of the
> > >> >
> > >> >area of interest in his application.
> > >> >
> > >> >pd-gem5 requires that simulation finish the current quantum
> > >> >
> > >> >before checkpointing, so it cannot provide this.
> > >> >
> > >> >(Shortening the quantum can help, but usually the snapshot is being
> > >>taken
> > >> >
> > >> >while 'fast-forwarding', i.e. simulating as fast as possible, which
> > >>would
> > >> >
> > >> >motivate a longer quantum.)
> > >> >
> > >> >multi-gem5 can enter the drain cycle immediately upon receiving a
> > >> >
> > >> >checkpoint request. We find this accuracy highly desirable.
> > >> >
> > >> >It¹s true that if you have a large quantum size then there would be
> > >>some
> > >> >discrepancy between the m5_ckpt instruction tick and the actual dump
> > >>tick.
> > >> >Based on multi-gem5 code, my understanding is that you send async
> > >> >checkpoint message as soon as one of the gem5 processes encounter
> > >>m5_ckpt
> > >> >instruction. But I¹m not sure how you fix the aforementioned issue,
> > >> >because
> > >> >you have to sync all gem5 processes before you start dumping
> > >>checkpoint,
> > >> >which necessitate a global synchronization beforehand.
> > >>
> > >> In multi-gem5, the gem5 process who encounters the m5_ckpt instruction
> > >> sends out an async checkpoint notification for the peer gem5 processes
> > >>and
> > >> then it starts the draining immediately (at the same tick). So the
> > >> checkpoint will be taken at the exact tick form the initiator process
> > >> point of view. The global synchronisation with the peer processes
> takes
> > >> place while the initiator process is still waiting at the same tick
> (i.e
> > >> the simulation thread is suspended). However, the receiver thread
> > >> Continues reading out the socket - while waiting for the global sync
> to
> > >> complete- to make sure that in-flight data packets from peer gem5
> > >>processes
> > >> are stored properly and saved into the checkpoint.
> > >>
> > >>
> > >So you mean multi-gem5 ends up with having gem5 processes with different
> > >ticks after checkpoint? In pd-gem5 we make sure that all gem5 processes
> > >start dumping checkpoint at the same tick. Are you sure that this is
> > >correct to have each gem5 process dump checkpoint at different ticks???
> > >
> > >I don't think this a correct checkpointing design. However, if you feel
> it
> > >is correct, I can change a couple of lines in "Simulation.py" and
> barrier
> > >scripts to implement the same functionality in pd-gem5. One thing that
> you
> > >are obsessed about is to make sure that there is no in-flight packets
> > >while
> > >we start dumping checkpoint, and you have all these complex mechanisms
> in
> > >place to insure that! I think you can 99.99999% make sure that there is
> no
> > >in-flight packet by waiting for 1 second after all gem5 processes
> finished
> > >their quantum simulation and then dump checkpoint. Do you really think
> > >that
> > >delivering a tcp packet would take more than 1 second in today's
> systems!?
> > >Always go for simple solutions ...
> > >
> > >
> > >
> > >> >
> > >> >By the way, we have a fix for this issue by introducing a new m5
> pseudo
> > >> >instruction.
> > >>
> > >> I fail to see how a new pseudo instruction can solve the problem of
> > >> completing the full quantum in pd-gem5 before a checkpoint can be
> taken.
> > >> Could you please elaborate on that?
> > >>
> > >> As we take checkpoint while fast-forwarding and it is likely that we
> > >>relax
> > >synchronization for speedup purpose, a new pseudo instruction that can
> set
> > >quantum size (m5_qset) can be helpful. So, one can insert m5_qset in his
> > >benchmark source code before entering ROI that contains m5_ckpt to
> > >decrease
> > >quantum size beforehand and reduce the discrepancy between m5_ckpt tick
> > >and
> > >actual checkpoint tick. This is not included in pd-gem5 patch right now.
> > >
> > >
> > >> >
> > >> >* Implementation of network topology.
> > >> >
> > >> >pd-gem5 uses a separate gem5 process to act as a switch whereas
> > >>multi-gem5
> > >> >
> > >> >uses a standalone packet relay process.
> > >> >
> > >> >We haven't measured the overhead of pd-gem5's simulated switch yet,
> but
> > >> >
> > >> >we're confident that our approach is at least as fast and more
> > >>scalable.
> > >> >
> > >> >There is this flexibility in pd-gem5 to simulate a switch box
> alongside
> > >> >one
> > >> >of the other gem5 processes. However, it might make that gem5 process
> > >>the
> > >> >simulation bottleneck. One of the advantages of pd-gem5 over
> > >>multi-gem5 is
> > >> >that we use gem5 to simulate a switch box, which allows us to model
> any
> > >> >network topology by instantiating several Switch simObjects and
> > >> >interconnect them with EhterLink in an arbitrary fashion. A
> standalone
> > >>tcp
> > >> >server just can provide switch functionality (forwarding packets to
> > >> >destinations) and model a star network topology. Furthermore, it
> cannot
> > >> >model various network timings such as queueing delay, congestion, and
> > >> >routing latency. Also it has some accuracy issues that I will point
> out
> > >> >next.
> > >>
> > >> I agree with the complex topology argument. We already mentioned that
> > >> before as an advantage for pd-gem5 from the point of view of future
> > >> extensions. However, I do not agree that multi-gem5 cannot model
> > >>queueing
> > >> delays and congestions. For a simple crossbar switch, it can model
> > >>queueing
> > >> delays and congestions, but the receive queues are distributed among
> the
> > >> gem5 processes.
> > >>
> > >> It's true that you can model queuing delay of a simple crossbar by
> > >distributing queues across gem5 processes (end points). But to be able
> to
> > >do so you have to ensure the ordering of packets that you enqueue in the
> > >distributed queues. It is almost impossible without a synchronized
> switch
> > >box. You should have a reorder queue that reorders packets dynamically
> and
> > >updates timing parameter for each packet as well. I don't know how much
> > >progress have you had to ensure ordering scheme in multi-gem5 but you
> may
> > >already realized that how complex and error prone it can be. This
> argument
> > >is also related to my next argument for "Broken network timing".
> > >
> > >
> > >> >
> > >> >* Broken network timing:
> > >> >
> > >> >Forwarding packets between gem5 processes using a standalone tcp
> server
> > >> >can
> > >> >cause reordering between packets that have different source but same
> > >> >destination. It causes inaccurate network timing and worse of all
> > >> >non-deterministic simulation. pd-gem5 resolve this by reordering
> > >>packets
> > >> >at
> > >> >Switch process and then send them to their destination (it¹s possible
> > >>as
> > >> >switch is synchronized with the rest of the nodes).
> > >>
> > >> In multi-gem5, there is always a HeaderPkt that contains some meta
> > >> information for each data packet. The meta information include the
> send
> > >> tick and the sender rank (i.e. a unique ID of the sender gem5
> process).
> > >> We use those information to define a well defined ordering of packets
> > >>even
> > >> if packets are arriving at the same receiver from different senders.
> > >>This
> > >> packet ordering scheme is still being tested so the corresponding
> patch
> > >>is
> > >> not on the RB yet.
> > >>
> > >> Please read my previous comment. The most important part of
> > >>multi/pd-gem5
> > >extension is ensuring accurate and deterministic simulation.
> > >
> > >
> > >> >
> > >> >* Amount of changes
> > >> >
> > >> >pd-gem5 introduce different modes in etherlink just to provide
> accurate
> > >> >timing for each component in the network subsystem (NIC, link,
> switch)
> > >>as
> > >> >well as capability of modeling different network topologies (mesh,
> > >>ring,
> > >> >fat tree, etc). To enable a simple functionality, like what
> multi-gem5
> > >> >provides, the amount of changes in gem5 can be limited to
> time-stamping
> > >> >packets and providing synchronization through python scripts.
> However,
> > >> >multi-gem5 re-implements functionalists that are already in gem5.
> > >>
> > >> This argument holds only if both implementations are correct (robust).
> > >>It
> > >> still seems to me that pd-gem5 does not provide correctness for the
> > >> synchronization/checkpointing parts.
> > >>
> > >> Again, please read my first comment for correctness of pd-gem5.
> > >
> > >
> > >> >
> > >> >* Integrating with gem5 mainstream:
> > >> >
> > >> >pd-gem5 launch script is written in python which is suited for
> > >>integration
> > >> >with gem5 python scripts. However multi-gem5 uses bash script. Also,
> > >>all
> > >> >source files in pd-gem5 are already parts of gem5 mainstream. However
> > >> >multi-gem5 has tcp_server.cc/hh that is a standalone process and
> > cannot
> > >> be
> > >> >part of gem5.
> > >>
> > >> The multi-gem5 launch script is simply enough to rely only on the
> > >>shell. It
> > >> can obviously be easily re-written in python if that added any value.
> > >>The
> > >> tcp_server component is only a utility (like the "m5" utility that is
> > >>also
> > >> part of gem5).
> > >>
> > >> The thing is that it's more likely that users want to add some
> > >functionality to the run-script of multi/pd-gem5. E.g. pd-gem5
> run-script
> > >supports launching simulations using a simulation pool management
> > >software (
> > >http://research.cs.wisc.edu/htcondor/). Using python enables users to
> > >easily add these kind of supports.
> > >
> > >
> > >>
> > >> Cheers,
> > >> - Gabor
> > >>
> > >>
> > >> >On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham <
> ***@arm.com>
> > >> >wrote:
> > >> >
> > >> >>Hello everyone,
> > >> >>We have taken a look at how pd-gem5 compares with multi-gem5. While
> > >> >>intending
> > >> >>to deliver the same functionality, there are some crucial
> differences:
> > >> >>
> > >> >>* Synchronization.
> > >> >>
> > >> >> pd-gem5 implements this in Python (not a problem in itself;
> > >> >>aesthetically
> > >> >> this is nice, but...). The issue is that pd-gem5's data packets
> > >>and
> > >> >> barrier messages travel over different sockets. Since pd-gem5
> > >>could
> > >> >>see
> > >> >> data packets passing synchronization barriers, it could create
> an
> > >> >> inconsistent checkpoint.
> > >> >>
> > >> >> multi-gem5's synchronization is implemented in C++ using sync
> > >>events,
> > >> >>but
> > >> >> more importantly, the messages queue up in the same stream and
> so
> > >> >>cannot
> > >> >> have the issue just described. (Event ordering is often crucial
> > >>in
> > >> >> snapshot protocols.) Therefore we feel that multi-gem5 is a more
> > >> >>robust
> > >> >> solution in this respect.
> > >> >>
> > >> >>* Packet handling.
> > >> >>
> > >> >> pd-gem5 uses EtherTap for data packets but changed the polling
> > >> >>mechanism
> > >> >> to go through the main event queue. Since this rate is actually
> > >> >>linked
> > >> >> with simulator progress, it cannot guarantee that the packets
> are
> > >> >>serviced
> > >> >> at regular intervals of real time. This can lead to packets
> > >> >>queueing up
> > >> >> which would contribute to the synchronization issues mentioned
> > >>above.
> > >> >>
> > >> >> multi-gem5 uses plain sockets with separate receive threads and
> so
> > >> >>does
> > >> >>not
> > >> >> have this issue.
> > >> >>
> > >> >>* Checkpoint accuracy.
> > >> >>
> > >> >> A user would like to have a checkpoint at precisely the time the
> > >> >> 'm5 checkpoint' operation is executed so as to not miss any of
> the
> > >> >> area of interest in his application.
> > >> >>
> > >> >> pd-gem5 requires that simulation finish the current quantum
> > >> >> before checkpointing, so it cannot provide this.
> > >> >>
> > >> >> (Shortening the quantum can help, but usually the snapshot is
> being
> > >> >>taken
> > >> >> while 'fast-forwarding', i.e. simulating as fast as possible,
> which
> > >> >>would
> > >> >> motivate a longer quantum.)
> > >> >>
> > >> >> multi-gem5 can enter the drain cycle immediately upon receiving a
> > >> >> checkpoint request. We find this accuracy highly desirable.
> > >> >>
> > >> >>* Implementation of network topology.
> > >> >>
> > >> >> pd-gem5 uses a separate gem5 process to act as a switch whereas
> > >> >>multi-gem5
> > >> >> uses a standalone packet relay process.
> > >> >>
> > >> >> We haven't measured the overhead of pd-gem5's simulated switch
> yet,
> > >> >>but
> > >> >> we're confident that our approach is at least as fast and more
> > >> >>scalable.
> > >> >>
> > >> >>
> > >> >>Thanks,
> > >> >>Curtis
> > >> >>________________________________________
> > >> >>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad
> > >>Alian [
> > >> >>***@wisc.edu]
> > >> >>Sent: Friday, June 26, 2015 7:37 PM
> > >> >>To: gem5 Developer List
> > >> >>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
> > >> >>system
> > >> >>on multiple physical hosts
> > >> >>
> > >> >>Hi Anthony,
> > >> >>
> > >> >>I think that would be a good option, then I can add pd-gem5
> > >> >>functionality
> > >> >>on top of that. Right now I've simplified your implementation.
> Also, I
> > >> >>think I had found some bugs in your patch that I cannot remember
> now.
> > >>If
> > >> >>you decided to ship EtherSwitch patch, let me know to give you a
> > >>review
> > >> >>on
> > >> >>that.
> > >> >>
> > >> >>Thanks,
> > >> >>Mohammad
> > >> >>
> > >> >>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
> > >> >>***@amd.com> wrote:
> > >> >>
> > >> >>>Would it make sense for me to ship the EtherSwitch patch first,
> since
> > >> >>it
> > >> >>>has utility on its own, and then we can decide which of the
> > >> >>"multi-gem5"
> > >> >>>approaches is best, or if it's some combination of both?
> > >> >>>
> > >> >>>The only reason I never shipped it was because Steve raised an
> issue
> > >> >>that
> > >> >>>I didn't have a good alternative for, and didn't have the time to
> > >>look
> > >> >>into
> > >> >>>one at that time.
> > >> >>>________________________________________
> > >> >>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
> > >> >>Alian [
> > >> >>>***@wisc.edu]
> > >> >>>Sent: Wednesday, June 24, 2015 12:43 PM
> > >> >>>To: gem5 Developer List
> > >> >>>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
> > >> >>system
> > >> >>>on multiple physical hosts
> > >> >>>
> > >> >>>Hi Andreas,
> > >> >>>
> > >> >>>Thanks for the comment.
> > >> >>>I think the checkpointing support in both works is the same. Here
> is
> > >> >>how
> > >> >>>checkpointing support is implemented in pd-gem5:
> > >> >>>
> > >> >>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> > >> >>>instruction, it will send a ³recv-ckpt² signal to the
> > >> >>>³barrier² process. Then the ³barrier² process sends a ³take-ckpt²
> > >> >>signal
> > >> >>to
> > >> >>>all the simulated nodes
> > >> >>>(including the node that encountered m5-checkpoint) at the end of
> the
> > >> >>>current simulation quantum. On the reception of
> > >> >>>³take-ckpt² signal, gem5 processes start dumping check-points. This
> > >> >>makes
> > >> >>>each simulated node dump a checkpoint
> > >> >>>at the same simulated time point while ensuring there is no
> in-flight
> > >> >>>packets.
> > >> >>>
> > >> >>>I believe this is the same as multi-gem5 patch approach for
> > >>checkpoint
> > >> >>>support (based on the commit message of
> > >> >>http://reviews.gem5.org/r/2865/
> > >> >>).
> > >> >>>Also, we have tested our mechanism with several benchmarks and it
> > >> >>works.
> > >> >>As
> > >> >>>Steve suggested, I'll look into Curtis's patch and try to review it
> > >>as
> > >> >>>well.
> > >> >>>But as Nilay also mentioned earlier, there are some codes missing
> in
> > >> >>>Curtis's patch. I prefer to first run multi-gem5 before starting to
> > >> >>review
> > >> >>>it.
> > >> >>>
> > >> >>>Thank you,
> > >> >>>Mohammad
> > >> >>>
> > >> >>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
> > >> >>***@arm.com>
> > >> >>>wrote:
> > >> >>>
> > >> >>>>Hi Steve,
> > >> >>>>
> > >> >>>>Apologies for the confusion. We are on the same page. My point is
> > >> >>that
> > >> >>we
> > >> >>>>cannot simply take a little bit of patch A and a little bit of
> > >> >>patch B.
> > >> >>>>This change involves a lot of code, and we need to approach this
> in
> > >> >>a
> > >> >>>>structured fashion. My proposal is to do it bottom up, and start
> by
> > >> >>>>getting the basic support in place. Since
> > >> >>>http://reviews.gem5.org/r/2826/
> > >> >>>>has already been on the review board for a few months, I am merely
> > >> >>>>suggesting that the it would be a good start to relate the newly
> > >> >>posted
> > >> >>>>patches to what is already there.
> > >> >>>>
> > >> >>>>Andreas
> > >> >>>>
> > >> >>>>
> > >> >>>>
> > >> >>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> > >> >>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
> > >> >>>>
> > >> >>>>>Hi Andreas,
> > >> >>>>>
> > >> >>>>>I'm a little confused by your email---you say you're
> fundamentally
> > >> >>>opposed
> > >> >>>>>to looking at both patches and picking the best features, then
> you
> > >> >>point
> > >> >>>>>out that the patches Curtis posted have the feature of better
> > >> >>>>>checkpointing
> > >> >>>>>support so we should pick that :).
> > >> >>>>>
> > >> >>>>>Obviously we can't just pick patch A from Mohammad's set and
> patch
> > >> >>B
> > >> >>>from
> > >> >>>>>Curtis's set and expect them to work together, but I think that
> > >> >>having
> > >> >>>>>both
> > >> >>>>>sets of patches available and comparing and contrasting the two
> > >> >>>>>implementations should enable us to get to a single
> implementation
> > >> >>>that's
> > >> >>>>>the best of both. Someone will have to make the effort of
> > >> >>integrating
> > >> >>>the
> > >> >>>>>better ideas from one set into the other set to create a new
> > >> >>unified
> > >> >>set
> > >> >>>>>of
> > >> >>>>>patches; (or maybe we commit one set and then integrate the best
> of
> > >> >>the
> > >> >>>>>other set as patches on top of that), but the first step is to
> > >> >>identify
> > >> >>>>>what "the best of both" is. Having Mohammad look at Curtis's
> > >> >>patches,
> > >> >>>and
> > >> >>>>>Curtis (or someone else from ARM) closely examine Mohammad's
> > >> >>patches
> > >> >>>would
> > >> >>>>>be a great start. I intend to review them both, though
> > >> >>unfortunately
> > >> >>my
> > >> >>>>>time has been scarce lately---I'm hoping to squeeze that in later
> > >> >>this
> > >> >>>>>week.
> > >> >>>>>
> > >> >>>>>Once we've had a few people look at both, we can discuss the pros
> > >> >>and
> > >> >>>cons
> > >> >>>>>of each, then discuss the strategy for getting the best features
> > >> >>in.
> > >> >>So
> > >> >>>>>far I've heard that Mohammad's patches have a better network
> model
> > >> >>but
> > >> >>>the
> > >> >>>>>ARM patches have better checkpointing support; that seems like a
> > >> >>good
> > >> >>>>>start.
> > >> >>>>>
> > >> >>>>>Steve
> > >> >>>>>
> > >> >>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> > >> >>>***@arm.com
> > >> >>>>>
> > >> >>>>>wrote:
> > >> >>>>>
> > >> >>>>>>Hi all,
> > >> >>>>>>
> > >> >>>>>>Great work. However, I fundamentally do not believe in the
> > >> >>approach
> > >> >>of
> > >> >>>>>>Œletting reviewers pick the best features¹. There is no way we
> > >> >>would
> > >> >>>>>>ever
> > >> >>>>>>get something working out if it. We need to get _one_ working
> > >> >>solution
> > >> >>>>>>here, and figure out how to best get there. I would propose to
> > >> >>do it
> > >> >>>>>>bottom up, starting with the basic multi-simulator instance
> > >> >>support,
> > >> >>>>>>checkpointing support, and then move on to the network between
> > >> >>the
> > >> >>>>>>simulator instances.
> > >> >>>>>>
> > >> >>>>>>Thus, I propose we go with the low-level plumbing and checkpoint
> > >> >>>support
> > >> >>>>>>from what Curtis has posted. I believe proper checkpointing
> > >> >>support
> > >> >>to
> > >> >>>>>>be
> > >> >>>>>>the most challenging, and from what I can tell this is far more
> > >> >>>limited
> > >> >>>>>>in
> > >> >>>>>>what you just posted Mohammad. Could you perhaps review Curtis
> > >> >>patches
> > >> >>>>>>based on your insights, and we can try and get these patches in
> > >> >>shape
> > >> >>>>>>and
> > >> >>>>>>committed asap.
> > >> >>>>>>
> > >> >>>>>>Once we have the baseline functionality in place, then we can
> > >> >>start
> > >> >>>>>>looking at the more elaborate network models.
> > >> >>>>>>
> > >> >>>>>>Does this sound reasonable?
> > >> >>>>>>
> > >> >>>>>>Thanks,
> > >> >>>>>>
> > >> >>>>>>Andreas
> > >> >>>>>>
> > >> >>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> > >> >>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> > >> >>>>>>
> > >> >>>>>>>Hello All,
> > >> >>>>>>>
> > >> >>>>>>>I have submitted a chain of patches which enables gem5 to
> > >> >>simulate
> > >> >>a
> > >> >>>>>>>cluster on multiple physical hosts:
> > >> >>>>>>>
> > >> >>>>>>>http://reviews.gem5.org/r/2909/
> > >> >>>>>>>http://reviews.gem5.org/r/2910/
> > >> >>>>>>>http://reviews.gem5.org/r/2912/
> > >> >>>>>>>http://reviews.gem5.org/r/2913/
> > >> >>>>>>>http://reviews.gem5.org/r/2914/
> > >> >><http://reviews.gem5.org/r/2914/>
> > >> >>>>>>>
> > >> >>>>>>>and a patch that contains run scripts for a simple experiment:
> > >> >>>>>>>http://reviews.gem5.org/r/2915/
> > >> >>>>>>>
> > >> >>>>>>>We have run several benchmarks using this infrastructure,
> > >> >>including
> > >> >>>NAS
> > >> >>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
> > >> >>>>>>>(http://prof.ict.ac.cn/DCBench/),
> > >> >>>>>>>and would be happy to share scripts/diskimages.
> > >> >>>>>>>
> > >> >>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or less
> > >> >>the
> > >> >>>>>>same
> > >> >>>>>>>as
> > >> >>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
> > >> >>*network
> > >> >>>>>>model
> > >> >>>>>>>is
> > >> >>>>>>>more thorough; it also enables modeling different network
> > >> >>topologies.
> > >> >>>>>>>Having both set of changes together let reviewers to pick best
> > >> >>>features
> > >> >>>>>>>from both works.
> > >> >>>>>>>
> > >> >>>>>>>Thank you,
> > >> >>>>>>>Mohammad Alian
> > >> >>>>>>>_______________________________________________
> > >> >>>>>>>gem5-dev mailing list
> > >> >>>>>>>gem5-***@gem5.org
> > >> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> > >> >>attachments
> > >> >>>are
> > >> >>>>>>confidential and may also be privileged. If you are not the
> > >> >>intended
> > >> >>>>>>recipient, please notify the sender immediately and do not
> > >> >>disclose
> > >> >>>the
> > >> >>>>>>contents to any other person, use it for any purpose, or store
> or
> > >> >>copy
> > >> >>>>>>the
> > >> >>>>>>information in any medium. Thank you.
> > >> >>>>>>
> > >> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> > >> >>9NJ,
> > >> >>>>>>Registered in England & Wales, Company No: 2557590
> > >> >>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> > >> >>CB1
> > >> >>>>>>9NJ,
> > >> >>>>>>Registered in England & Wales, Company No: 2548782
> > >> >>>>>>_______________________________________________
> > >> >>>>>>gem5-dev mailing list
> > >> >>>>>>gem5-***@gem5.org
> > >> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> > >> >>>>>>
> > >> >>>>>_______________________________________________
> > >> >>>>>gem5-dev mailing list
> > >> >>>>>gem5-***@gem5.org
> > >> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> > >> >>>>
> > >> >>>>
> > >> >>>>-- IMPORTANT NOTICE: The contents of this email and any
> attachments
> > >> >>are
> > >> >>>>confidential and may also be privileged. If you are not the
> intended
> > >> >>>>recipient, please notify the sender immediately and do not
> disclose
> > >> >>the
> > >> >>>>contents to any other person, use it for any purpose, or store or
> > >> >>copy
> > >> >>>the
> > >> >>>>information in any medium. Thank you.
> > >> >>>>
> > >> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> 9NJ,
> > >> >>>>Registered in England & Wales, Company No: 2557590
> > >> >>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> CB1
> > >> >>9NJ,
> > >> >>>>Registered in England & Wales, Company No: 2548782
> > >> >>>>_______________________________________________
> > >> >>>>gem5-dev mailing list
> > >> >>>>gem5-***@gem5.org
> > >> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
> > >> >>>>
> > >> >>>_______________________________________________
> > >> >>>gem5-dev mailing list
> > >> >>>gem5-***@gem5.org
> > >> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> > >> >>>_______________________________________________
> > >> >>>gem5-dev mailing list
> > >> >>>gem5-***@gem5.org
> > >> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> > >> >>>
> > >> >>_______________________________________________
> > >> >>gem5-dev mailing list
> > >> >>gem5-***@gem5.org
> > >> >>http://m5sim.org/mailman/listinfo/gem5-dev
> > >> >>
> > >> >>-- IMPORTANT NOTICE: The contents of this email and any attachments
> > >>are
> > >> >>confidential and may also be privileged. If you are not the intended
> > >> >>recipient, please notify the sender immediately and do not disclose
> > >>the
> > >> >>contents to any other person, use it for any purpose, or store or
> copy
> > >> >>the
> > >> >>information in any medium. Thank you.
> > >> >>
> > >> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > >> >>Registered in England & Wales, Company No: 2557590
> > >> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> > >> >>9NJ,
> > >> >>Registered in England & Wales, Company No: 2548782
> > >> >>
> > >> >>_______________________________________________
> > >> >>gem5-dev mailing list
> > >> >>gem5-***@gem5.org
> > >> >>http://m5sim.org/mailman/listinfo/gem5-dev
> > >> >>
> > >> >_______________________________________________
> > >> >gem5-dev mailing list
> > >> >gem5-***@gem5.org
> > >> >http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> -- IMPORTANT NOTICE: The contents of this email and any attachments
> are
> > >> confidential and may also be privileged. If you are not the intended
> > >> recipient, please notify the sender immediately and do not disclose
> the
> > >> contents to any other person, use it for any purpose, or store or copy
> > >>the
> > >> information in any medium. Thank you.
> > >>
> > >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > >> Registered in England & Wales, Company No: 2557590
> > >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> > >>9NJ,
> > >> Registered in England & Wales, Company No: 2548782
> > >> _______________________________________________
> > >> gem5-dev mailing list
> > >> gem5-***@gem5.org
> > >> http://m5sim.org/mailman/listinfo/gem5-dev
> > >>
> > >_______________________________________________
> > >gem5-dev mailing list
> > >gem5-***@gem5.org
> > >http://m5sim.org/mailman/listinfo/gem5-dev
> >
> >
> > -- IMPORTANT NOTICE: The contents of this email and any attachments are
> > confidential and may also be privileged. If you are not the intended
> > recipient, please notify the sender immediately and do not disclose the
> > contents to any other person, use it for any purpose, or store or copy
> the
> > information in any medium. Thank you.
> >
> > ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No: 2557590
> > ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> > Registered in England & Wales, Company No: 2548782
> > _______________________________________________
> > gem5-dev mailing list
> > gem5-***@gem5.org
> > http://m5sim.org/mailman/listinfo/gem5-dev
> >
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabor Dozsa
2015-07-03 14:10:53 UTC
Permalink
Hi all,

Thank you Steve for the thorough review.

First, let me elaborate a bit on Andreas’s 3rd point about non-synchronous
checkpoints. Let’s assume that we aim to simulate MPI applications (HPC
workloads). The ROI in an MPI application is typically starts with a
global MPI_Barrier() call. We want to take the checkpoint when *every*
gem5 process is reached that MPI_Barrier() in the simulated code but that
may not happen at the same tick in each gem5 (due to load imbalance among
the simulated nodes). That’s why multi-gem5 implements the non-synchronous
checkpoint support.

My answers to your questions are as follows.

1. The only change necessary to use multi-gem5 with a non Ethernet
(simulated) network is to replace the Ethernet packet type with another
packet type in MultiIface.
In fact, the first implementation of MultiIface was a template
that took EthPacketData as parameter because I plan to support different
network types. When I realized that currently only Ethernet is supported
by gem5 I dropped the template param to keep the implementation simpler. I
have also realized in the meantime that the right approach would probably
be to create a pure virtual ‘base' class for network packets from which
Ethernet (and other types of) packets could be derived. Then MultiIface
could simply use that base class to provide support for different network
types. The interface provided by the base packet class could be very
simple. Beside the total size() of the packet, multi-gem5 only needs a
method to ‘extract' the source/destination address. Those addresses are
used in MultiIface as opaque byte arrays so they are quite network type
agnostic already.

2. That’s right, we have designed the MultiIface/TCPIface split with
different underlaying messaging systems in mind.

3. Multi-gem5 can work together with multi-threaded/multi-event-queue gem5
configs. The current TCPIface/tcp_server components would still use
sockets to send around the packets. So it is possible to put together a
multi-gem5 simulation where each gem5 process has multiple event queues
(and an independent simulation thread per event queue) but all the
simulated Ethernet links would use sockets to forward every Ethernet
packet to the tcp_server.

If someone wanted to run only a single gem5 process to simulate an entire
cluster (using one thread/event-queue per cluster node) then the current
multi-gem5 implementation using sockets/tcp_server is not optimal. In that
case, a better solution would be to provide a shared memory based
implementation of the MultiIface virtual communication methods
sendRaw()/recvRaw()/syncRaw() (i.e. a shared memory equivalent of
TCPIface). In that implementation, the entire discrete tcp_sever component
could be replaced with a shared data structure.

4. You are right, the current implementation does not make it possible to
construct an equivalent single-process simulation model for a multi-gem5
run. However, a possible solution is a shared memory based implementation
of the MultiIface virtual communication methods just as I described in the
previous paragraph. The same implementation could then work with both
multi-threaded/multi-event-queues and single-thread/single-event-queue
gem5 configs.

Thanks,
- Gabor

On 7/2/15, 7:20 PM, "Steve Reinhardt" <***@gmail.com> wrote:

>Hi everyone,
>
>Sorry for taking so long to engage. This is a great development and I
>think
>both these patches are terrific contributions. Thanks to Mohammad, Gabor,
>and everyone else involved.
>
>I agree with Andreas that we should start with some top-level goals &
>assumptions, agree on those, and then we can sort out the detailed issues
>based on a consistent view.
>
>I definitely agree with Andreas's first two points. The third one seems a
>little surprising; I'd like to hear more about the motivation before
>expressing an opinion. I can see where non-synchronous checkpointing could
>be useful, but it's also clear from the associated patch that it's not
>trivial to implement either. How much would be lost by requiring a
>synchronization before a checkpoint?
>
>From my personal perspective, I would like to see whatever we do here be a
>first step toward a more general distributed simulation platform. Both of
>these patches seem pretty Ethernet-centric in different ways. This is not
>terrible; part of the problem is that gem5's current internal networking
>support is already overly Ethernet-centric IMO. But it would be nice to
>avoid baking that in even further. Rather than assume I have understood
>all
>the code completely, I'll phrase things in the form of questions, and
>people can comment on how those questions would be answered in the context
>of the two different approaches.
>
>1. How much effort would be required to simulate a non-Ethernet network?
>My
>impression is that pd-gem5 has a leg up here, since a gem5 switch model
>for
>a non-Ethernet network (which you'd have to write anyway if you were
>simulating a different network) could be used in place of the current
>Ethernet switch, where for multi-gem5 I think that the
>util/multi//tcp_server.cc code would have to be modified (i.e., there'd be
>additional work above and beyond what you'd need to get the network
>modeled
>in base gem5).
>
>2. How much effort is required to run on a non-Ethernet network (or
>equivalently using a non-sockets API)? The MultiIface/TCPIface split in
>the multi-gem5 code looks like it addresses this nicely, but pd-gem5 seems
>pretty tied to an Ethernet host fabric.
>
>3. Do both of these patches work with the existing multithreaded
>multiple-event-queue simulation? I think multi-gem5 does (though it would
>be nice to have a confirmation), but it's not clear about pd-gem5. I don't
>see a benefit to having multiple gem5 processes on a single host vs. a
>single multithreaded gem5 process using the existing support. I think this
>could be particularly valuable with a hierarchical network; e.g., maybe I
>would want to model a rack in multithreaded mode on a single multicore
>server, then use pd-gem5 or multi-gem5 to build up a simulation of
>multiple
>racks. Would this work out of the box with either of these patches, and if
>not, what would need to be done?
>
>4. Is it possible to construct a single-process simulation model that's
>identical to the distributed simulation? It would be very valuable for
>verification to be able to take a single simulation run and do it both
>within a single process and also across multiple processes and verify that
>identical results are achieved. This seems like a big drawback to the
>multi-gem5 tcp_server approach, IMO.
>
>I'm definitely not saying that all these issues need to be resolved before
>anything gets committed, but if we can agree that these are valid goals,
>then we can evaluate detailed issues based on whether they move us toward
>or away from those goals.
>
>Thanks,
>
>Steve
>
>
>On Thu, Jul 2, 2015 at 8:34 AM Andreas Hansson <***@arm.com>
>wrote:
>
>>Hi all,
>>
>>I think we need to up-level this a bit. From our perspective (and I
>>suspect in general):
>>
>>1. Robustness is important. Having a design that _may_ break, however
>>unlikely is simply not an option.
>>
>>2. Performance and scaling is important. We can compare actual numbers
>>here, and I am fairly sure the two solutions are on par. Let’s quantify
>>that though.
>>
>>3. Checkpointing must not rely on synchronicity. It is vital for several
>>workloads that we can checkpoint the various gem5 instances at different
>>Ticks (due to the way the workloads are constructed).
>>
>>Andreas
>>
>>On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>>
>>>Thanks Gabor for the reply.
>>>
>>>I feel this conversation is useful as we can find out pros/cons of each
>>>design.
>>>Please find my response in-lined below.
>>>
>>>Thank you,
>>>Mohammad
>>>
>>>On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com>
>>wrote:
>>>
>>>> Hi All,
>>>>
>>>> Sorry for the missing indentation in my previous e-mail! (This was my
>>>> first e-mail to the dev-list so I could not simply use “reply").
>>Below
>>>>is
>>>> the same message, hopefully in more readable form.
>>>>
>>>> ====================================
>>>>
>>>> Hi All,
>>>>
>>>> Thank you Mohammad for your elaboration on the issues!
>>>>
>>>> I have written most of the multi-gem5 patch so let me add some more
>>>> clarifications and answer to your concerns. My comments are inline
>>>>below.
>>>>
>>>> Thanks,
>>>> - Gabor
>>>>
>>>> On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
>>>>
>>>> >Hi All,
>>>> >
>>>> >Curtis-Thank you for listing some of the differences. I was waiting
>>for
>>>> >the
>>>> >completed multi-gem5 patch before I send my review. Please see my
>>>>inline
>>>> >response below. I¹ve addressed the concerns that you¹ve raised.
>>Also,
>>>>I¹ve
>>>> >added a bit more to the comparison.
>>>> >
>>>> >-* Synchronization.
>>>> >
>>>> >pd-gem5 implements this in Python (not a problem in itself;
>>>>aesthetically
>>>> >
>>>> >this is nice, but...). The issue is that pd-gem5's data packets and
>>>> >
>>>> >barrier messages travel over different sockets. Since pd-gem5 could
>>>>see
>>>> >
>>>> >data packets passing synchronization barriers, it could create an
>>>> >
>>>> >inconsistent checkpoint.
>>>> >
>>>> >multi-gem5's synchronization is implemented in C++ using sync
>>events,
>>>>but
>>>> >
>>>> >more importantly, the messages queue up in the same stream and so
>>>>cannot
>>>> >
>>>> >have the issue just described. (Event ordering is often crucial in
>>>> >
>>>> >snapshot protocols.) Therefore we feel that multi-gem5 is a more
>>robust
>>>> >
>>>> >solution in this respect.
>>>> >
>>>> >Each packet in pd-gem5 has a time-stamp. So even if data packets
>>pass
>>>> >synchronization barriers (in another word data packets arrive early
>>at
>>>>the
>>>> >destination node), destination node process packets based on their
>>>> >timestamp. Actually allowing data packets to pass sync barriers is a
>>>>nice
>>>> >feature that can reduce the likelihood of late packet reception.
>>>>Ordering
>>>> >of data messages that flow over pd-gem5 nodes is also preserved in
>>>>pd-gem5
>>>> >implementation.
>>>>
>>>> This seems to be a misunderstanding. Maybe the wording was not
>>precise
>>>> before.The problem is not a data packet that “passing" a sync barrier
>>>> but the other way around, a sync barrier that can pass a data packet
>>>> (e.g. while the data packet is waiting in the host operating system
>>>> socket layer). If that happens, the packet will arrive later than it
>>>>was
>>>> supposed to and it may miss the computed receive tick.
>>>>
>>>> For instance, let’s assume that the quantum coincides with the
>>simulated
>>>> Ether link delay. (This is the optimal choice of quantum to minimize
>>the
>>>> number of sync barriers.) If a data packet is sent right at the
>>>>beginning
>>>> of a quantum then this packet must arrive at the destination gem5
>>>>process
>>>> within the same quantum in order not to miss its receive tick at the
>>>>very
>>>> beginning of the next quantum. If the sync barrier can pass the data
>>>>packet
>>>> then the data packet may arrive only during the next quantum (or in
>>>> extreme conditions even later than that) so when it arrives the
>>receiver
>>>> gem5 may pass already the receive tick.
>>>>
>>>> This argument makes more sense than the previous one. Note that gem5
>>is
>>>>a
>>>cycle accurate simulator and it runs orders of magnitude slower that
>>real
>>>hardware. So it's almost impossible that the flight time of packet
>>through
>>>real network turns to be more that simulation time of one quantum. We
>>ran
>>>a
>>>set of experiments just for this purpose: with quantum size equal to
>>>etherlink delay, we never got any late arrival violation (what you
>>>described) for full NAS benchmarks suit (please refer to the paper).
>>>
>>>multi-gem5 is optimized for a case that almost never happens! and
>>>scarifying speedup for no gain.
>>>
>>>
>>>> Time-stamping does help with this issue. Also, if a data packet is
>>>>waiting
>>>> in the host operating system socket layer when the simulation thread
>>>>exits
>>>> to python to complete the next sync barrier then the packet will
>>not go
>>>> into the checkpoint that may follow that sync barrier.
>>>>
>>>> That's a good point. Current pd-gem5 checkpointing mechanism might
>>miss
>>>packets that have been sent during previous quantum and are waiting in
>>OS
>>>socket buffer. I should add some code inside ethertap serialization
>>>function to drain ethertap socket before writing checkpoint. I will
>>update
>>>pd-gem5 patch accordingly.
>>>
>>>>
>>>> >What you mentioned as an advantage for multi-gem5 is actually a key
>>>> >disadvantage: buffering sync messages behind data packets can add
>>up to
>>>> >the
>>>> >synchronization overhead and slow down simulation significantly.
>>>>
>>>> The purpose of sync messages is to make sure that the data packets
>>>>arrive
>>>> in time (in terms of simulated time) at the destination so they can
>>be
>>>> scheduled for being received at the proper computed tick. Sync
>>messages
>>>> also make sure that no data packets are in flight when a sync barrier
>>>> completes before we take a checkpoint. They definitely add overhead
>>for
>>>> the simulation but they are necessary for the correctness of the
>>>> simulation.
>>>>
>>>> The receive thread in multi-gem5 reads out packets from the socket in
>>>> parallel with the simulation thread so packets normally will not be
>>>> "queueing up” before a sync barrier message. There is definitely
>>room
>>>> for improvements in the current implementation for reducing the
>>>> synchronization overhead but that is likely true for pd-gem5, too.
>>>> The important thing here is that the solution must provide
>>correctness
>>>> (robustness) first.
>>>>
>>>> pd-gem5 provides correctness. Please read my previous comment. The
>>whole
>>>purpose of multi/pd-gem5 is to parallelize simulation with minimal
>>>overhead
>>>and gain speedup. If you fail to do so, nobody will use your tool.
>>>
>>>
>>>> >Also,
>>>> >multi-gem5 send huge sized messages (multiHeaderPkt) through
>>network to
>>>> >perform each synchronization point, which increases synchronization
>>>> >overhead further. In pd-gem5, we choose to send just one character
>>as
>>>>sync
>>>> >message through a separate socket to reduce synchronization
>>overhead.
>>>>
>>>> The TCP/IP message size is unlikely the bottleneck here. Multi-gem5
>>will
>>>> send ~50 bytes more in a sync barrier message than pd-gem5 but that
>>>>bigger
>>>> sync message still fits into a single ethernet frame on the wire. The
>>>> end-to-end latency overhead that is caused by 50 bytes extra payload
>>for
>>>> a small single frame TCP/IP message is likely to fall into the
>>“noise"
>>>> category if one tries to measure it in a real cluster.
>>>>
>>>> You should prove your hypothesis experimentally. Each gem5 process
>>>send/receive sync messages at the end of every quantum. Say you are
>>>simulating "N" node computer cluster with "M" different configuration.
>>>Then
>>>you will have N*M gem5 processes that send/receive these 50 Bytes (it
>>>think
>>>it's more) extra data at the same time over network ...
>>>
>>>Furthermore, multi-gem5 send a header before each data message.
>>Comparing
>>>with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12 least
>>>significant digits of the Tick) to each data packet. I don't know
>>exactly
>>>how large are these "MultiHeaderPkt", but it just has two Tick field
>>that
>>>each is 64 Bytes! Also, header packets are separate TCP packets, so you
>>>pay
>>>for sending two separate packets for each data packet. And worst, you
>>>serialize all of these with sync messages.
>>>
>>>
>>>> >
>>>> >* Packet handling.
>>>> >
>>>> >pd-gem5 uses EtherTap for data packets but changed the polling
>>>>mechanism
>>>> >
>>>> >to go through the main event queue. Since this rate is actually
>>linked
>>>> >
>>>> >with simulator progress, it cannot guarantee that the packets are
>>>> >serviced
>>>> >
>>>> >at regular intervals of real time. This can lead to packets
>>queueing
>>>>up
>>>> >
>>>> >which would contribute to the synchronization issues mentioned
>>above.
>>>> >
>>>> >multi-gem5 uses plain sockets with separate receive threads and so
>>does
>>>> >not
>>>> >
>>>> >have this issue.
>>>> >
>>>> >I think again you are pointing to your first concern that I¹ve
>>>>explained
>>>> >above. Packets that have queued up in EtherTap socket, will be
>>>>processed
>>>> >and delivered to simulation environment at the beginning of next
>>>> >simulation
>>>> >quantum.
>>>> >
>>>> >Please notice that multi-gem5 introduces a new simObjects to
>>interface
>>>> >simulation environment to real world which is redundant. This
>>>> >functionality
>>>> >is already there by EtherTap.
>>>>
>>>> Except that the EtherTap solution does not provide a correct (robust)
>>>> solution for the synchronization problem.
>>>>
>>>> Please read my first/second comments.
>>>
>>>
>>>> >
>>>> >* Checkpoint accuracy.
>>>> >
>>>> >A user would like to have a checkpoint at precisely the time the
>>>> >
>>>> >'m5 checkpoint' operation is executed so as to not miss any of the
>>>> >
>>>> >area of interest in his application.
>>>> >
>>>> >pd-gem5 requires that simulation finish the current quantum
>>>> >
>>>> >before checkpointing, so it cannot provide this.
>>>> >
>>>> >(Shortening the quantum can help, but usually the snapshot is being
>>>>taken
>>>> >
>>>> >while 'fast-forwarding', i.e. simulating as fast as possible, which
>>>>would
>>>> >
>>>> >motivate a longer quantum.)
>>>> >
>>>> >multi-gem5 can enter the drain cycle immediately upon receiving a
>>>> >
>>>> >checkpoint request. We find this accuracy highly desirable.
>>>> >
>>>> >It¹s true that if you have a large quantum size then there would be
>>>>some
>>>> >discrepancy between the m5_ckpt instruction tick and the actual dump
>>>>tick.
>>>> >Based on multi-gem5 code, my understanding is that you send async
>>>> >checkpoint message as soon as one of the gem5 processes encounter
>>>>m5_ckpt
>>>> >instruction. But I¹m not sure how you fix the aforementioned issue,
>>>> >because
>>>> >you have to sync all gem5 processes before you start dumping
>>>>checkpoint,
>>>> >which necessitate a global synchronization beforehand.
>>>>
>>>> In multi-gem5, the gem5 process who encounters the m5_ckpt
>>instruction
>>>> sends out an async checkpoint notification for the peer gem5
>>processes
>>>>and
>>>> then it starts the draining immediately (at the same tick). So the
>>>> checkpoint will be taken at the exact tick form the initiator process
>>>> point of view. The global synchronisation with the peer processes
>>takes
>>>> place while the initiator process is still waiting at the same tick
>>(i.e
>>>> the simulation thread is suspended). However, the receiver thread
>>>> Continues reading out the socket - while waiting for the global sync
>>to
>>>> complete- to make sure that in-flight data packets from peer gem5
>>>>processes
>>>> are stored properly and saved into the checkpoint.
>>>>
>>>>
>>>So you mean multi-gem5 ends up with having gem5 processes with
>>different
>>>ticks after checkpoint? In pd-gem5 we make sure that all gem5 processes
>>>start dumping checkpoint at the same tick. Are you sure that this is
>>>correct to have each gem5 process dump checkpoint at different ticks???
>>>
>>>I don't think this a correct checkpointing design. However, if you
>>feel it
>>>is correct, I can change a couple of lines in "Simulation.py" and
>>barrier
>>>scripts to implement the same functionality in pd-gem5. One thing that
>>you
>>>are obsessed about is to make sure that there is no in-flight packets
>>>while
>>>we start dumping checkpoint, and you have all these complex mechanisms
>>in
>>>place to insure that! I think you can 99.99999% make sure that there
>>is no
>>>in-flight packet by waiting for 1 second after all gem5 processes
>>finished
>>>their quantum simulation and then dump checkpoint. Do you really think
>>>that
>>>delivering a tcp packet would take more than 1 second in today's
>>systems!?
>>>Always go for simple solutions ...
>>>
>>>
>>>
>>>> >
>>>> >By the way, we have a fix for this issue by introducing a new m5
>>pseudo
>>>> >instruction.
>>>>
>>>> I fail to see how a new pseudo instruction can solve the problem of
>>>> completing the full quantum in pd-gem5 before a checkpoint can be
>>taken.
>>>> Could you please elaborate on that?
>>>>
>>>> As we take checkpoint while fast-forwarding and it is likely that we
>>>>relax
>>>synchronization for speedup purpose, a new pseudo instruction that can
>>set
>>>quantum size (m5_qset) can be helpful. So, one can insert m5_qset in
>>his
>>>benchmark source code before entering ROI that contains m5_ckpt to
>>>decrease
>>>quantum size beforehand and reduce the discrepancy between m5_ckpt tick
>>>and
>>>actual checkpoint tick. This is not included in pd-gem5 patch right
>>now.
>>>
>>>
>>>> >
>>>> >* Implementation of network topology.
>>>> >
>>>> >pd-gem5 uses a separate gem5 process to act as a switch whereas
>>>>multi-gem5
>>>> >
>>>> >uses a standalone packet relay process.
>>>> >
>>>> >We haven't measured the overhead of pd-gem5's simulated switch yet,
>>but
>>>> >
>>>> >we're confident that our approach is at least as fast and more
>>>>scalable.
>>>> >
>>>> >There is this flexibility in pd-gem5 to simulate a switch box
>>alongside
>>>> >one
>>>> >of the other gem5 processes. However, it might make that gem5
>>process
>>>>the
>>>> >simulation bottleneck. One of the advantages of pd-gem5 over
>>>>multi-gem5 is
>>>> >that we use gem5 to simulate a switch box, which allows us to model
>>any
>>>> >network topology by instantiating several Switch simObjects and
>>>> >interconnect them with EhterLink in an arbitrary fashion. A
>>standalone
>>>>tcp
>>>> >server just can provide switch functionality (forwarding packets to
>>>> >destinations) and model a star network topology. Furthermore, it
>>cannot
>>>> >model various network timings such as queueing delay, congestion,
>>and
>>>> >routing latency. Also it has some accuracy issues that I will point
>>out
>>>> >next.
>>>>
>>>> I agree with the complex topology argument. We already mentioned that
>>>> before as an advantage for pd-gem5 from the point of view of future
>>>> extensions. However, I do not agree that multi-gem5 cannot model
>>>>queueing
>>>> delays and congestions. For a simple crossbar switch, it can model
>>>>queueing
>>>> delays and congestions, but the receive queues are distributed among
>>the
>>>> gem5 processes.
>>>>
>>>> It's true that you can model queuing delay of a simple crossbar by
>>>distributing queues across gem5 processes (end points). But to be able
>>to
>>>do so you have to ensure the ordering of packets that you enqueue in
>>the
>>>distributed queues. It is almost impossible without a synchronized
>>switch
>>>box. You should have a reorder queue that reorders packets dynamically
>>and
>>>updates timing parameter for each packet as well. I don't know how much
>>>progress have you had to ensure ordering scheme in multi-gem5 but you
>>may
>>>already realized that how complex and error prone it can be. This
>>argument
>>>is also related to my next argument for "Broken network timing".
>>>
>>>
>>>> >
>>>> >* Broken network timing:
>>>> >
>>>> >Forwarding packets between gem5 processes using a standalone tcp
>>server
>>>> >can
>>>> >cause reordering between packets that have different source but same
>>>> >destination. It causes inaccurate network timing and worse of all
>>>> >non-deterministic simulation. pd-gem5 resolve this by reordering
>>>>packets
>>>> >at
>>>> >Switch process and then send them to their destination (it¹s
>>possible
>>>>as
>>>> >switch is synchronized with the rest of the nodes).
>>>>
>>>> In multi-gem5, there is always a HeaderPkt that contains some meta
>>>> information for each data packet. The meta information include the
>>send
>>>> tick and the sender rank (i.e. a unique ID of the sender gem5
>>process).
>>>> We use those information to define a well defined ordering of packets
>>>>even
>>>> if packets are arriving at the same receiver from different senders.
>>>>This
>>>> packet ordering scheme is still being tested so the corresponding
>>patch
>>>>is
>>>> not on the RB yet.
>>>>
>>>> Please read my previous comment. The most important part of
>>>>multi/pd-gem5
>>>extension is ensuring accurate and deterministic simulation.
>>>
>>>
>>>> >
>>>> >* Amount of changes
>>>> >
>>>> >pd-gem5 introduce different modes in etherlink just to provide
>>accurate
>>>> >timing for each component in the network subsystem (NIC, link,
>>switch)
>>>>as
>>>> >well as capability of modeling different network topologies (mesh,
>>>>ring,
>>>> >fat tree, etc). To enable a simple functionality, like what
>>multi-gem5
>>>> >provides, the amount of changes in gem5 can be limited to
>>time-stamping
>>>> >packets and providing synchronization through python scripts.
>>However,
>>>> >multi-gem5 re-implements functionalists that are already in gem5.
>>>>
>>>> This argument holds only if both implementations are correct
>>(robust).
>>>>It
>>>> still seems to me that pd-gem5 does not provide correctness for the
>>>> synchronization/checkpointing parts.
>>>>
>>>> Again, please read my first comment for correctness of pd-gem5.
>>>
>>>
>>>> >
>>>> >* Integrating with gem5 mainstream:
>>>> >
>>>> >pd-gem5 launch script is written in python which is suited for
>>>>integration
>>>> >with gem5 python scripts. However multi-gem5 uses bash script. Also,
>>>>all
>>>> >source files in pd-gem5 are already parts of gem5 mainstream.
>>However
>>>> >multi-gem5 has tcp_server.cc/hh that is a standalone process and
>>cannot
>>>> be
>>>> >part of gem5.
>>>>
>>>> The multi-gem5 launch script is simply enough to rely only on the
>>>>shell. It
>>>> can obviously be easily re-written in python if that added any value.
>>>>The
>>>> tcp_server component is only a utility (like the "m5" utility that is
>>>>also
>>>> part of gem5).
>>>>
>>>> The thing is that it's more likely that users want to add some
>>>functionality to the run-script of multi/pd-gem5. E.g. pd-gem5
>>run-script
>>>supports launching simulations using a simulation pool management
>>>software (
>>>http://research.cs.wisc.edu/htcondor/). Using python enables users to
>>>easily add these kind of supports.
>>>
>>>
>>>>
>>>> Cheers,
>>>> - Gabor
>>>>
>>>>
>>>> >On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham
>><***@arm.com>
>>>> >wrote:
>>>> >
>>>> >>Hello everyone,
>>>> >>We have taken a look at how pd-gem5 compares with multi-gem5.
>>While
>>>> >>intending
>>>> >>to deliver the same functionality, there are some crucial
>>differences:
>>>> >>
>>>> >>* Synchronization.
>>>> >>
>>>> >> pd-gem5 implements this in Python (not a problem in itself;
>>>> >>aesthetically
>>>> >> this is nice, but...). The issue is that pd-gem5's data
>>packets
>>>>and
>>>> >> barrier messages travel over different sockets. Since pd-gem5
>>>>could
>>>> >>see
>>>> >> data packets passing synchronization barriers, it could create
>>an
>>>> >> inconsistent checkpoint.
>>>> >>
>>>> >> multi-gem5's synchronization is implemented in C++ using sync
>>>>events,
>>>> >>but
>>>> >> more importantly, the messages queue up in the same stream and
>>so
>>>> >>cannot
>>>> >> have the issue just described. (Event ordering is often
>>crucial
>>>>in
>>>> >> snapshot protocols.) Therefore we feel that multi-gem5 is a
>>more
>>>> >>robust
>>>> >> solution in this respect.
>>>> >>
>>>> >>* Packet handling.
>>>> >>
>>>> >> pd-gem5 uses EtherTap for data packets but changed the polling
>>>> >>mechanism
>>>> >> to go through the main event queue. Since this rate is
>>actually
>>>> >>linked
>>>> >> with simulator progress, it cannot guarantee that the packets
>>are
>>>> >>serviced
>>>> >> at regular intervals of real time. This can lead to packets
>>>> >>queueing up
>>>> >> which would contribute to the synchronization issues mentioned
>>>>above.
>>>> >>
>>>> >> multi-gem5 uses plain sockets with separate receive threads
>>and so
>>>> >>does
>>>> >>not
>>>> >> have this issue.
>>>> >>
>>>> >>* Checkpoint accuracy.
>>>> >>
>>>> >> A user would like to have a checkpoint at precisely the time the
>>>> >> 'm5 checkpoint' operation is executed so as to not miss any of
>>the
>>>> >> area of interest in his application.
>>>> >>
>>>> >> pd-gem5 requires that simulation finish the current quantum
>>>> >> before checkpointing, so it cannot provide this.
>>>> >>
>>>> >> (Shortening the quantum can help, but usually the snapshot is
>>being
>>>> >>taken
>>>> >> while 'fast-forwarding', i.e. simulating as fast as possible,
>>which
>>>> >>would
>>>> >> motivate a longer quantum.)
>>>> >>
>>>> >> multi-gem5 can enter the drain cycle immediately upon receiving
>>a
>>>> >> checkpoint request. We find this accuracy highly desirable.
>>>> >>
>>>> >>* Implementation of network topology.
>>>> >>
>>>> >> pd-gem5 uses a separate gem5 process to act as a switch whereas
>>>> >>multi-gem5
>>>> >> uses a standalone packet relay process.
>>>> >>
>>>> >> We haven't measured the overhead of pd-gem5's simulated switch
>>yet,
>>>> >>but
>>>> >> we're confident that our approach is at least as fast and more
>>>> >>scalable.
>>>> >>
>>>> >>
>>>> >>Thanks,
>>>> >>Curtis
>>>> >>________________________________________
>>>> >>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad
>>>>Alian [
>>>> >>***@wisc.edu]
>>>> >>Sent: Friday, June 26, 2015 7:37 PM
>>>> >>To: gem5 Developer List
>>>> >>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
>>>> >>system
>>>> >>on multiple physical hosts
>>>> >>
>>>> >>Hi Anthony,
>>>> >>
>>>> >>I think that would be a good option, then I can add pd-gem5
>>>> >>functionality
>>>> >>on top of that. Right now I've simplified your implementation.
>>Also, I
>>>> >>think I had found some bugs in your patch that I cannot remember
>>now.
>>>>If
>>>> >>you decided to ship EtherSwitch patch, let me know to give you a
>>>>review
>>>> >>on
>>>> >>that.
>>>> >>
>>>> >>Thanks,
>>>> >>Mohammad
>>>> >>
>>>> >>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
>>>> >>***@amd.com> wrote:
>>>> >>
>>>> >>>Would it make sense for me to ship the EtherSwitch patch first,
>>since
>>>> >>it
>>>> >>>has utility on its own, and then we can decide which of the
>>>> >>"multi-gem5"
>>>> >>>approaches is best, or if it's some combination of both?
>>>> >>>
>>>> >>>The only reason I never shipped it was because Steve raised an
>>issue
>>>> >>that
>>>> >>>I didn't have a good alternative for, and didn't have the time to
>>>>look
>>>> >>into
>>>> >>>one at that time.
>>>> >>>________________________________________
>>>> >>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
>>>> >>Alian [
>>>> >>>***@wisc.edu]
>>>> >>>Sent: Wednesday, June 24, 2015 12:43 PM
>>>> >>>To: gem5 Developer List
>>>> >>>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
>>>> >>system
>>>> >>>on multiple physical hosts
>>>> >>>
>>>> >>>Hi Andreas,
>>>> >>>
>>>> >>>Thanks for the comment.
>>>> >>>I think the checkpointing support in both works is the same. Here
>>is
>>>> >>how
>>>> >>>checkpointing support is implemented in pd-gem5:
>>>> >>>
>>>> >>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
>>>> >>>instruction, it will send a ³recv-ckpt² signal to the
>>>> >>>³barrier² process. Then the ³barrier² process sends a ³take-ckpt²
>>>> >>signal
>>>> >>to
>>>> >>>all the simulated nodes
>>>> >>>(including the node that encountered m5-checkpoint) at the end of
>>the
>>>> >>>current simulation quantum. On the reception of
>>>> >>>³take-ckpt² signal, gem5 processes start dumping check-points.
>>This
>>>> >>makes
>>>> >>>each simulated node dump a checkpoint
>>>> >>>at the same simulated time point while ensuring there is no
>>in-flight
>>>> >>>packets.
>>>> >>>
>>>> >>>I believe this is the same as multi-gem5 patch approach for
>>>>checkpoint
>>>> >>>support (based on the commit message of
>>>> >>http://reviews.gem5.org/r/2865/
>>>> >>).
>>>> >>>Also, we have tested our mechanism with several benchmarks and it
>>>> >>works.
>>>> >>As
>>>> >>>Steve suggested, I'll look into Curtis's patch and try to review
>>it
>>>>as
>>>> >>>well.
>>>> >>>But as Nilay also mentioned earlier, there are some codes missing
>>in
>>>> >>>Curtis's patch. I prefer to first run multi-gem5 before starting
>>to
>>>> >>review
>>>> >>>it.
>>>> >>>
>>>> >>>Thank you,
>>>> >>>Mohammad
>>>> >>>
>>>> >>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
>>>> >>***@arm.com>
>>>> >>>wrote:
>>>> >>>
>>>> >>>>Hi Steve,
>>>> >>>>
>>>> >>>>Apologies for the confusion. We are on the same page. My point is
>>>> >>that
>>>> >>we
>>>> >>>>cannot simply take a little bit of patch A and a little bit of
>>>> >>patch B.
>>>> >>>>This change involves a lot of code, and we need to approach this
>>in
>>>> >>a
>>>> >>>>structured fashion. My proposal is to do it bottom up, and start
>>by
>>>> >>>>getting the basic support in place. Since
>>>> >>>http://reviews.gem5.org/r/2826/
>>>> >>>>has already been on the review board for a few months, I am
>>merely
>>>> >>>>suggesting that the it would be a good start to relate the newly
>>>> >>posted
>>>> >>>>patches to what is already there.
>>>> >>>>
>>>> >>>>Andreas
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
>>>> >>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
>>>> >>>>
>>>> >>>>>Hi Andreas,
>>>> >>>>>
>>>> >>>>>I'm a little confused by your email---you say you're
>>fundamentally
>>>> >>>opposed
>>>> >>>>>to looking at both patches and picking the best features, then
>>you
>>>> >>point
>>>> >>>>>out that the patches Curtis posted have the feature of better
>>>> >>>>>checkpointing
>>>> >>>>>support so we should pick that :).
>>>> >>>>>
>>>> >>>>>Obviously we can't just pick patch A from Mohammad's set and
>>patch
>>>> >>B
>>>> >>>from
>>>> >>>>>Curtis's set and expect them to work together, but I think that
>>>> >>having
>>>> >>>>>both
>>>> >>>>>sets of patches available and comparing and contrasting the two
>>>> >>>>>implementations should enable us to get to a single
>>implementation
>>>> >>>that's
>>>> >>>>>the best of both. Someone will have to make the effort of
>>>> >>integrating
>>>> >>>the
>>>> >>>>>better ideas from one set into the other set to create a new
>>>> >>unified
>>>> >>set
>>>> >>>>>of
>>>> >>>>>patches; (or maybe we commit one set and then integrate the
>>best of
>>>> >>the
>>>> >>>>>other set as patches on top of that), but the first step is to
>>>> >>identify
>>>> >>>>>what "the best of both" is. Having Mohammad look at Curtis's
>>>> >>patches,
>>>> >>>and
>>>> >>>>>Curtis (or someone else from ARM) closely examine Mohammad's
>>>> >>patches
>>>> >>>would
>>>> >>>>>be a great start. I intend to review them both, though
>>>> >>unfortunately
>>>> >>my
>>>> >>>>>time has been scarce lately---I'm hoping to squeeze that in
>>later
>>>> >>this
>>>> >>>>>week.
>>>> >>>>>
>>>> >>>>>Once we've had a few people look at both, we can discuss the
>>pros
>>>> >>and
>>>> >>>cons
>>>> >>>>>of each, then discuss the strategy for getting the best features
>>>> >>in.
>>>> >>So
>>>> >>>>>far I've heard that Mohammad's patches have a better network
>>model
>>>> >>but
>>>> >>>the
>>>> >>>>>ARM patches have better checkpointing support; that seems like a
>>>> >>good
>>>> >>>>>start.
>>>> >>>>>
>>>> >>>>>Steve
>>>> >>>>>
>>>> >>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
>>>> >>>***@arm.com
>>>> >>>>>
>>>> >>>>>wrote:
>>>> >>>>>
>>>> >>>>>>Hi all,
>>>> >>>>>>
>>>> >>>>>>Great work. However, I fundamentally do not believe in the
>>>> >>approach
>>>> >>of
>>>> >>>>>>Œletting reviewers pick the best features¹. There is no way we
>>>> >>would
>>>> >>>>>>ever
>>>> >>>>>>get something working out if it. We need to get _one_ working
>>>> >>solution
>>>> >>>>>>here, and figure out how to best get there. I would propose to
>>>> >>do it
>>>> >>>>>>bottom up, starting with the basic multi-simulator instance
>>>> >>support,
>>>> >>>>>>checkpointing support, and then move on to the network between
>>>> >>the
>>>> >>>>>>simulator instances.
>>>> >>>>>>
>>>> >>>>>>Thus, I propose we go with the low-level plumbing and
>>checkpoint
>>>> >>>support
>>>> >>>>>>from what Curtis has posted. I believe proper checkpointing
>>>> >>support
>>>> >>to
>>>> >>>>>>be
>>>> >>>>>>the most challenging, and from what I can tell this is far more
>>>> >>>limited
>>>> >>>>>>in
>>>> >>>>>>what you just posted Mohammad. Could you perhaps review Curtis
>>>> >>patches
>>>> >>>>>>based on your insights, and we can try and get these patches in
>>>> >>shape
>>>> >>>>>>and
>>>> >>>>>>committed asap.
>>>> >>>>>>
>>>> >>>>>>Once we have the baseline functionality in place, then we can
>>>> >>start
>>>> >>>>>>looking at the more elaborate network models.
>>>> >>>>>>
>>>> >>>>>>Does this sound reasonable?
>>>> >>>>>>
>>>> >>>>>>Thanks,
>>>> >>>>>>
>>>> >>>>>>Andreas
>>>> >>>>>>
>>>> >>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
>>>> >>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>>>> >>>>>>
>>>> >>>>>>>Hello All,
>>>> >>>>>>>
>>>> >>>>>>>I have submitted a chain of patches which enables gem5 to
>>>> >>simulate
>>>> >>a
>>>> >>>>>>>cluster on multiple physical hosts:
>>>> >>>>>>>
>>>> >>>>>>>http://reviews.gem5.org/r/2909/
>>>> >>>>>>>http://reviews.gem5.org/r/2910/
>>>> >>>>>>>http://reviews.gem5.org/r/2912/
>>>> >>>>>>>http://reviews.gem5.org/r/2913/
>>>> >>>>>>>http://reviews.gem5.org/r/2914/
>>>> >><http://reviews.gem5.org/r/2914/>
>>>> >>>>>>>
>>>> >>>>>>>and a patch that contains run scripts for a simple experiment:
>>>> >>>>>>>http://reviews.gem5.org/r/2915/
>>>> >>>>>>>
>>>> >>>>>>>We have run several benchmarks using this infrastructure,
>>>> >>including
>>>> >>>NAS
>>>> >>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
>>>> >>>>>>>(http://prof.ict.ac.cn/DCBench/),
>>>> >>>>>>>and would be happy to share scripts/diskimages.
>>>> >>>>>>>
>>>> >>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or
>>less
>>>> >>the
>>>> >>>>>>same
>>>> >>>>>>>as
>>>> >>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
>>>> >>*network
>>>> >>>>>>model
>>>> >>>>>>>is
>>>> >>>>>>>more thorough; it also enables modeling different network
>>>> >>topologies.
>>>> >>>>>>>Having both set of changes together let reviewers to pick best
>>>> >>>features
>>>> >>>>>>>from both works.
>>>> >>>>>>>
>>>> >>>>>>>Thank you,
>>>> >>>>>>>Mohammad Alian
>>>> >>>>>>>_______________________________________________
>>>> >>>>>>>gem5-dev mailing list
>>>> >>>>>>>gem5-***@gem5.org
>>>> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>>>> >>attachments
>>>> >>>are
>>>> >>>>>>confidential and may also be privileged. If you are not the
>>>> >>intended
>>>> >>>>>>recipient, please notify the sender immediately and do not
>>>> >>disclose
>>>> >>>the
>>>> >>>>>>contents to any other person, use it for any purpose, or store
>>or
>>>> >>copy
>>>> >>>>>>the
>>>> >>>>>>information in any medium. Thank you.
>>>> >>>>>>
>>>> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>>>> >>9NJ,
>>>> >>>>>>Registered in England & Wales, Company No: 2557590
>>>> >>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
>>Cambridge
>>>> >>CB1
>>>> >>>>>>9NJ,
>>>> >>>>>>Registered in England & Wales, Company No: 2548782
>>>> >>>>>>_______________________________________________
>>>> >>>>>>gem5-dev mailing list
>>>> >>>>>>gem5-***@gem5.org
>>>> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>> >>>>>>
>>>> >>>>>_______________________________________________
>>>> >>>>>gem5-dev mailing list
>>>> >>>>>gem5-***@gem5.org
>>>> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>> >>>>
>>>> >>>>
>>>> >>>>-- IMPORTANT NOTICE: The contents of this email and any
>>attachments
>>>> >>are
>>>> >>>>confidential and may also be privileged. If you are not the
>>intended
>>>> >>>>recipient, please notify the sender immediately and do not
>>disclose
>>>> >>the
>>>> >>>>contents to any other person, use it for any purpose, or store or
>>>> >>copy
>>>> >>>the
>>>> >>>>information in any medium. Thank you.
>>>> >>>>
>>>> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>>>> >>>>Registered in England & Wales, Company No: 2557590
>>>> >>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>>CB1
>>>> >>9NJ,
>>>> >>>>Registered in England & Wales, Company No: 2548782
>>>> >>>>_______________________________________________
>>>> >>>>gem5-dev mailing list
>>>> >>>>gem5-***@gem5.org
>>>> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>> >>>>
>>>> >>>_______________________________________________
>>>> >>>gem5-dev mailing list
>>>> >>>gem5-***@gem5.org
>>>> >>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>> >>>_______________________________________________
>>>> >>>gem5-dev mailing list
>>>> >>>gem5-***@gem5.org
>>>> >>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>> >>>
>>>> >>_______________________________________________
>>>> >>gem5-dev mailing list
>>>> >>gem5-***@gem5.org
>>>> >>http://m5sim.org/mailman/listinfo/gem5-dev
>>>> >>
>>>> >>-- IMPORTANT NOTICE: The contents of this email and any attachments
>>>>are
>>>> >>confidential and may also be privileged. If you are not the
>>intended
>>>> >>recipient, please notify the sender immediately and do not disclose
>>>>the
>>>> >>contents to any other person, use it for any purpose, or store or
>>copy
>>>> >>the
>>>> >>information in any medium. Thank you.
>>>> >>
>>>> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>>>> >>Registered in England & Wales, Company No: 2557590
>>>> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>>CB1
>>>> >>9NJ,
>>>> >>Registered in England & Wales, Company No: 2548782
>>>> >>
>>>> >>_______________________________________________
>>>> >>gem5-dev mailing list
>>>> >>gem5-***@gem5.org
>>>> >>http://m5sim.org/mailman/listinfo/gem5-dev
>>>> >>
>>>> >_______________________________________________
>>>> >gem5-dev mailing list
>>>> >gem5-***@gem5.org
>>>> >http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -- IMPORTANT NOTICE: The contents of this email and any attachments
>>are
>>>> confidential and may also be privileged. If you are not the intended
>>>> recipient, please notify the sender immediately and do not disclose
>>the
>>>> contents to any other person, use it for any purpose, or store or
>>copy
>>>>the
>>>> information in any medium. Thank you.
>>>>
>>>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>>>> Registered in England & Wales, Company No: 2557590
>>>> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>>>9NJ,
>>>> Registered in England & Wales, Company No: 2548782
>>>> _______________________________________________
>>>> gem5-dev mailing list
>>>> gem5-***@gem5.org
>>>> http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>>_______________________________________________
>>>gem5-dev mailing list
>>>gem5-***@gem5.org
>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>
>>-- IMPORTANT NOTICE: The contents of this email and any attachments are
>>confidential and may also be privileged. If you are not the intended
>>recipient, please notify the sender immediately and do not disclose the
>>contents to any other person, use it for any purpose, or store or copy
>>the
>>information in any medium. Thank you.
>>
>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>>Registered in England & Wales, Company No: 2557590
>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>>Registered in England & Wales, Company No: 2548782
>>_______________________________________________
>>gem5-dev mailing list
>>gem5-***@gem5.org
>>http://m5sim.org/mailman/listinfo/gem5-dev
>>
>_______________________________________________
>gem5-dev mailing list
>gem5-***@gem5.org
>http://m5sim.org/mailman/listinfo/gem5-dev




-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Steve Reinhardt
2015-07-03 18:33:51 UTC
Permalink
Thanks Mohammad & Gabor for the responses.

I think there's still some misunderstanding on what I mean by the
integration of multi-threaded and multi-host simulation based on Gabor's
response above and Andreas's response in the other thread.

The primary example scenario I'm proposing is as Mohammad described: within
each host node, we're simulating an entire rack + top-of-rack switch in a
single gem5 process, with separate event queues/threads being used to
parallelize across nodes within the rack. The switch may or may not be on
its own thread as well. The synchronization among the threads only needs
to be at the granularity of the intra-rack network latency.

Now we want to expand this by using pd-gem5 or multi-gem5 to parallelize
multiple of these rack-level simulations across hosts, so we can simulate a
whole row of a datacenter. Only the uplinks from the TOR switches would
need to go over sockets between processes, and the switch being modeled by
pd-gem5 or multi-gem5 would be the end-of-row switch. The synchronization
delay among the multiple gem5 processes would be based on the inter-rack
latency.

So the basic question is: Is this feasible with pd-gem5 / multi-gem5, and
if not, how much work would it take to make it so?

However, my larger point is that I still don't see value in ever building a
shared-memory transport for MultiIface. For this model, there is clearly no
need for it. Things get more complicated if we want to do something like
have N nodes connected to a single switch and split that over two hosts
(with N/2 nodes simulated on each), but even in that case, I think it's a
better idea to make the switch model deal with having half of its links
internal and half external (since we already want the same model to work in
both the all-internal and all-external cases). Not that I'm worried that
someone is about to go off and build this shared-memory transport, but I
think it's important to reach an understanding here, since it's fundamental
to defining the strategic relationship between these capabilities going
forward.

Stepping back a little further, it would be nice to have a model that is as
generic as the multi-threading model, where it's really just a matter of
taking a simulation, partitioning the components among the threads, and
setting the synchronization quantum, and it works. Of course, even with the
multi-threaded model, if you don't choose your partitioning and your
quantum wisely, you're not going to get much speedup or a deterministic
simulation, but the fundamental implementation is oblivious to that. I'm
not saying we really need to go all the way to this extreme---it's pretty
reasonable to assume that no one in the near future will want to partition
across hosts anywhere other than on a simulated network link---but I think
we should keep this ideal in mind as a guiding principle as we choose how
to go forward from here.

This ties in to my point #4, which is that if we're really building a
mechanism to partition a simulation across multiple hosts, then you should
be able to run the same simulation in a single gem5 process and get the
same results. I think this is the strength of pd-gem5; correspondingly the
main weakness of multi-gem5 is that it architecturally feels more like
tying together a set of mostly independent gem5 simulations than like
partitioning a single gem5 simulation. (Of course, they both end up at
roughly the same point in the middle.)

On the flip side, multi-gem5 has some clear advantages in terms of the
better separation of the communication layer (and I can imagine it being
very useful to port to MPI and perhaps some RDMA API for InfiniBand
clusters). Also I think the integrated sockets for communication and
syncrhonization are the superior design; while the separate sockets used by
pd-gem5 may only very rarely cause problems, I agree with Andreas that
that's not good enough, and I don't see any real advantage either---if you
have to flush the data sockets (or wait for them to drain) before
synchronizing, then you might as well just have the synchronization
messages queue up behind the data messages.

Regarding unsynchronized checkpoints: Thanks for the example, but I'm still
a little confused. If all the processes are about to execute an
MPI_Barrier(), doesn't that mean they'll all be synchronized shortly
anyway? So what's the harm until waiting until they're synchronized and
then checkpointing?

Regarding the simulation of non-Ethernet networks: I agree that the biggest
obstacle to this is the lack of generality of the current gem5 network
components. I tried to take a step toward supporting other link types two
years ago (see http://reviews.gem5.org/r/1922) but someone shot me down ;).
We shouldn't try and fix that here, but we should also consciously try not
to make it any worse...

Thanks for reading all the way to the end!

Steve


On Fri, Jul 3, 2015 at 7:11 AM Gabor Dozsa <***@arm.com> wrote:

> Hi all,
>
> Thank you Steve for the thorough review.
>
> First, let me elaborate a bit on Andreas’s 3rd point about non-synchronous
> checkpoints. Let’s assume that we aim to simulate MPI applications (HPC
> workloads). The ROI in an MPI application is typically starts with a
> global MPI_Barrier() call. We want to take the checkpoint when *every*
> gem5 process is reached that MPI_Barrier() in the simulated code but that
> may not happen at the same tick in each gem5 (due to load imbalance among
> the simulated nodes). That’s why multi-gem5 implements the non-synchronous
> checkpoint support.
>
> My answers to your questions are as follows.
>
> 1. The only change necessary to use multi-gem5 with a non Ethernet
> (simulated) network is to replace the Ethernet packet type with another
> packet type in MultiIface.
> In fact, the first implementation of MultiIface was a template
> that took EthPacketData as parameter because I plan to support different
> network types. When I realized that currently only Ethernet is supported
> by gem5 I dropped the template param to keep the implementation simpler. I
> have also realized in the meantime that the right approach would probably
> be to create a pure virtual ‘base' class for network packets from which
> Ethernet (and other types of) packets could be derived. Then MultiIface
> could simply use that base class to provide support for different network
> types. The interface provided by the base packet class could be very
> simple. Beside the total size() of the packet, multi-gem5 only needs a
> method to ‘extract' the source/destination address. Those addresses are
> used in MultiIface as opaque byte arrays so they are quite network type
> agnostic already.
>
> 2. That’s right, we have designed the MultiIface/TCPIface split with
> different underlaying messaging systems in mind.
>
> 3. Multi-gem5 can work together with multi-threaded/multi-event-queue gem5
> configs. The current TCPIface/tcp_server components would still use
> sockets to send around the packets. So it is possible to put together a
> multi-gem5 simulation where each gem5 process has multiple event queues
> (and an independent simulation thread per event queue) but all the
> simulated Ethernet links would use sockets to forward every Ethernet
> packet to the tcp_server.
>
> If someone wanted to run only a single gem5 process to simulate an entire
> cluster (using one thread/event-queue per cluster node) then the current
> multi-gem5 implementation using sockets/tcp_server is not optimal. In that
> case, a better solution would be to provide a shared memory based
> implementation of the MultiIface virtual communication methods
> sendRaw()/recvRaw()/syncRaw() (i.e. a shared memory equivalent of
> TCPIface). In that implementation, the entire discrete tcp_sever component
> could be replaced with a shared data structure.
>
> 4. You are right, the current implementation does not make it possible to
> construct an equivalent single-process simulation model for a multi-gem5
> run. However, a possible solution is a shared memory based implementation
> of the MultiIface virtual communication methods just as I described in the
> previous paragraph. The same implementation could then work with both
> multi-threaded/multi-event-queues and single-thread/single-event-queue
> gem5 configs.
>
> Thanks,
> - Gabor
>
> On 7/2/15, 7:20 PM, "Steve Reinhardt" <***@gmail.com> wrote:
>
> >Hi everyone,
> >
> >Sorry for taking so long to engage. This is a great development and I
> >think
> >both these patches are terrific contributions. Thanks to Mohammad, Gabor,
> >and everyone else involved.
> >
> >I agree with Andreas that we should start with some top-level goals &
> >assumptions, agree on those, and then we can sort out the detailed issues
> >based on a consistent view.
> >
> >I definitely agree with Andreas's first two points. The third one seems a
> >little surprising; I'd like to hear more about the motivation before
> >expressing an opinion. I can see where non-synchronous checkpointing could
> >be useful, but it's also clear from the associated patch that it's not
> >trivial to implement either. How much would be lost by requiring a
> >synchronization before a checkpoint?
> >
> >From my personal perspective, I would like to see whatever we do here be a
> >first step toward a more general distributed simulation platform. Both of
> >these patches seem pretty Ethernet-centric in different ways. This is not
> >terrible; part of the problem is that gem5's current internal networking
> >support is already overly Ethernet-centric IMO. But it would be nice to
> >avoid baking that in even further. Rather than assume I have understood
> >all
> >the code completely, I'll phrase things in the form of questions, and
> >people can comment on how those questions would be answered in the context
> >of the two different approaches.
> >
> >1. How much effort would be required to simulate a non-Ethernet network?
> >My
> >impression is that pd-gem5 has a leg up here, since a gem5 switch model
> >for
> >a non-Ethernet network (which you'd have to write anyway if you were
> >simulating a different network) could be used in place of the current
> >Ethernet switch, where for multi-gem5 I think that the
> >util/multi//tcp_server.cc code would have to be modified (i.e., there'd be
> >additional work above and beyond what you'd need to get the network
> >modeled
> >in base gem5).
> >
> >2. How much effort is required to run on a non-Ethernet network (or
> >equivalently using a non-sockets API)? The MultiIface/TCPIface split in
> >the multi-gem5 code looks like it addresses this nicely, but pd-gem5 seems
> >pretty tied to an Ethernet host fabric.
> >
> >3. Do both of these patches work with the existing multithreaded
> >multiple-event-queue simulation? I think multi-gem5 does (though it would
> >be nice to have a confirmation), but it's not clear about pd-gem5. I don't
> >see a benefit to having multiple gem5 processes on a single host vs. a
> >single multithreaded gem5 process using the existing support. I think this
> >could be particularly valuable with a hierarchical network; e.g., maybe I
> >would want to model a rack in multithreaded mode on a single multicore
> >server, then use pd-gem5 or multi-gem5 to build up a simulation of
> >multiple
> >racks. Would this work out of the box with either of these patches, and if
> >not, what would need to be done?
> >
> >4. Is it possible to construct a single-process simulation model that's
> >identical to the distributed simulation? It would be very valuable for
> >verification to be able to take a single simulation run and do it both
> >within a single process and also across multiple processes and verify that
> >identical results are achieved. This seems like a big drawback to the
> >multi-gem5 tcp_server approach, IMO.
> >
> >I'm definitely not saying that all these issues need to be resolved before
> >anything gets committed, but if we can agree that these are valid goals,
> >then we can evaluate detailed issues based on whether they move us toward
> >or away from those goals.
> >
> >Thanks,
> >
> >Steve
> >
> >
> >On Thu, Jul 2, 2015 at 8:34 AM Andreas Hansson <***@arm.com>
> >wrote:
> >
> >>Hi all,
> >>
> >>I think we need to up-level this a bit. From our perspective (and I
> >>suspect in general):
> >>
> >>1. Robustness is important. Having a design that _may_ break, however
> >>unlikely is simply not an option.
> >>
> >>2. Performance and scaling is important. We can compare actual numbers
> >>here, and I am fairly sure the two solutions are on par. Let’s quantify
> >>that though.
> >>
> >>3. Checkpointing must not rely on synchronicity. It is vital for several
> >>workloads that we can checkpoint the various gem5 instances at different
> >>Ticks (due to the way the workloads are constructed).
> >>
> >>Andreas
> >>
> >>On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
> >><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> >>
> >>>Thanks Gabor for the reply.
> >>>
> >>>I feel this conversation is useful as we can find out pros/cons of each
> >>>design.
> >>>Please find my response in-lined below.
> >>>
> >>>Thank you,
> >>>Mohammad
> >>>
> >>>On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com>
> >>wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> Sorry for the missing indentation in my previous e-mail! (This was my
> >>>> first e-mail to the dev-list so I could not simply use “reply").
> >>Below
> >>>>is
> >>>> the same message, hopefully in more readable form.
> >>>>
> >>>> ====================================
> >>>>
> >>>> Hi All,
> >>>>
> >>>> Thank you Mohammad for your elaboration on the issues!
> >>>>
> >>>> I have written most of the multi-gem5 patch so let me add some more
> >>>> clarifications and answer to your concerns. My comments are inline
> >>>>below.
> >>>>
> >>>> Thanks,
> >>>> - Gabor
> >>>>
> >>>> On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
> >>>>
> >>>> >Hi All,
> >>>> >
> >>>> >Curtis-Thank you for listing some of the differences. I was waiting
> >>for
> >>>> >the
> >>>> >completed multi-gem5 patch before I send my review. Please see my
> >>>>inline
> >>>> >response below. I¹ve addressed the concerns that you¹ve raised.
> >>Also,
> >>>>I¹ve
> >>>> >added a bit more to the comparison.
> >>>> >
> >>>> >-* Synchronization.
> >>>> >
> >>>> >pd-gem5 implements this in Python (not a problem in itself;
> >>>>aesthetically
> >>>> >
> >>>> >this is nice, but...). The issue is that pd-gem5's data packets and
> >>>> >
> >>>> >barrier messages travel over different sockets. Since pd-gem5 could
> >>>>see
> >>>> >
> >>>> >data packets passing synchronization barriers, it could create an
> >>>> >
> >>>> >inconsistent checkpoint.
> >>>> >
> >>>> >multi-gem5's synchronization is implemented in C++ using sync
> >>events,
> >>>>but
> >>>> >
> >>>> >more importantly, the messages queue up in the same stream and so
> >>>>cannot
> >>>> >
> >>>> >have the issue just described. (Event ordering is often crucial in
> >>>> >
> >>>> >snapshot protocols.) Therefore we feel that multi-gem5 is a more
> >>robust
> >>>> >
> >>>> >solution in this respect.
> >>>> >
> >>>> >Each packet in pd-gem5 has a time-stamp. So even if data packets
> >>pass
> >>>> >synchronization barriers (in another word data packets arrive early
> >>at
> >>>>the
> >>>> >destination node), destination node process packets based on their
> >>>> >timestamp. Actually allowing data packets to pass sync barriers is a
> >>>>nice
> >>>> >feature that can reduce the likelihood of late packet reception.
> >>>>Ordering
> >>>> >of data messages that flow over pd-gem5 nodes is also preserved in
> >>>>pd-gem5
> >>>> >implementation.
> >>>>
> >>>> This seems to be a misunderstanding. Maybe the wording was not
> >>precise
> >>>> before.The problem is not a data packet that “passing" a sync barrier
> >>>> but the other way around, a sync barrier that can pass a data packet
> >>>> (e.g. while the data packet is waiting in the host operating system
> >>>> socket layer). If that happens, the packet will arrive later than it
> >>>>was
> >>>> supposed to and it may miss the computed receive tick.
> >>>>
> >>>> For instance, let’s assume that the quantum coincides with the
> >>simulated
> >>>> Ether link delay. (This is the optimal choice of quantum to minimize
> >>the
> >>>> number of sync barriers.) If a data packet is sent right at the
> >>>>beginning
> >>>> of a quantum then this packet must arrive at the destination gem5
> >>>>process
> >>>> within the same quantum in order not to miss its receive tick at the
> >>>>very
> >>>> beginning of the next quantum. If the sync barrier can pass the data
> >>>>packet
> >>>> then the data packet may arrive only during the next quantum (or in
> >>>> extreme conditions even later than that) so when it arrives the
> >>receiver
> >>>> gem5 may pass already the receive tick.
> >>>>
> >>>> This argument makes more sense than the previous one. Note that gem5
> >>is
> >>>>a
> >>>cycle accurate simulator and it runs orders of magnitude slower that
> >>real
> >>>hardware. So it's almost impossible that the flight time of packet
> >>through
> >>>real network turns to be more that simulation time of one quantum. We
> >>ran
> >>>a
> >>>set of experiments just for this purpose: with quantum size equal to
> >>>etherlink delay, we never got any late arrival violation (what you
> >>>described) for full NAS benchmarks suit (please refer to the paper).
> >>>
> >>>multi-gem5 is optimized for a case that almost never happens! and
> >>>scarifying speedup for no gain.
> >>>
> >>>
> >>>> Time-stamping does help with this issue. Also, if a data packet is
> >>>>waiting
> >>>> in the host operating system socket layer when the simulation thread
> >>>>exits
> >>>> to python to complete the next sync barrier then the packet will
> >>not go
> >>>> into the checkpoint that may follow that sync barrier.
> >>>>
> >>>> That's a good point. Current pd-gem5 checkpointing mechanism might
> >>miss
> >>>packets that have been sent during previous quantum and are waiting in
> >>OS
> >>>socket buffer. I should add some code inside ethertap serialization
> >>>function to drain ethertap socket before writing checkpoint. I will
> >>update
> >>>pd-gem5 patch accordingly.
> >>>
> >>>>
> >>>> >What you mentioned as an advantage for multi-gem5 is actually a key
> >>>> >disadvantage: buffering sync messages behind data packets can add
> >>up to
> >>>> >the
> >>>> >synchronization overhead and slow down simulation significantly.
> >>>>
> >>>> The purpose of sync messages is to make sure that the data packets
> >>>>arrive
> >>>> in time (in terms of simulated time) at the destination so they can
> >>be
> >>>> scheduled for being received at the proper computed tick. Sync
> >>messages
> >>>> also make sure that no data packets are in flight when a sync barrier
> >>>> completes before we take a checkpoint. They definitely add overhead
> >>for
> >>>> the simulation but they are necessary for the correctness of the
> >>>> simulation.
> >>>>
> >>>> The receive thread in multi-gem5 reads out packets from the socket in
> >>>> parallel with the simulation thread so packets normally will not be
> >>>> "queueing up” before a sync barrier message. There is definitely
> >>room
> >>>> for improvements in the current implementation for reducing the
> >>>> synchronization overhead but that is likely true for pd-gem5, too.
> >>>> The important thing here is that the solution must provide
> >>correctness
> >>>> (robustness) first.
> >>>>
> >>>> pd-gem5 provides correctness. Please read my previous comment. The
> >>whole
> >>>purpose of multi/pd-gem5 is to parallelize simulation with minimal
> >>>overhead
> >>>and gain speedup. If you fail to do so, nobody will use your tool.
> >>>
> >>>
> >>>> >Also,
> >>>> >multi-gem5 send huge sized messages (multiHeaderPkt) through
> >>network to
> >>>> >perform each synchronization point, which increases synchronization
> >>>> >overhead further. In pd-gem5, we choose to send just one character
> >>as
> >>>>sync
> >>>> >message through a separate socket to reduce synchronization
> >>overhead.
> >>>>
> >>>> The TCP/IP message size is unlikely the bottleneck here. Multi-gem5
> >>will
> >>>> send ~50 bytes more in a sync barrier message than pd-gem5 but that
> >>>>bigger
> >>>> sync message still fits into a single ethernet frame on the wire. The
> >>>> end-to-end latency overhead that is caused by 50 bytes extra payload
> >>for
> >>>> a small single frame TCP/IP message is likely to fall into the
> >>“noise"
> >>>> category if one tries to measure it in a real cluster.
> >>>>
> >>>> You should prove your hypothesis experimentally. Each gem5 process
> >>>send/receive sync messages at the end of every quantum. Say you are
> >>>simulating "N" node computer cluster with "M" different configuration.
> >>>Then
> >>>you will have N*M gem5 processes that send/receive these 50 Bytes (it
> >>>think
> >>>it's more) extra data at the same time over network ...
> >>>
> >>>Furthermore, multi-gem5 send a header before each data message.
> >>Comparing
> >>>with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12 least
> >>>significant digits of the Tick) to each data packet. I don't know
> >>exactly
> >>>how large are these "MultiHeaderPkt", but it just has two Tick field
> >>that
> >>>each is 64 Bytes! Also, header packets are separate TCP packets, so you
> >>>pay
> >>>for sending two separate packets for each data packet. And worst, you
> >>>serialize all of these with sync messages.
> >>>
> >>>
> >>>> >
> >>>> >* Packet handling.
> >>>> >
> >>>> >pd-gem5 uses EtherTap for data packets but changed the polling
> >>>>mechanism
> >>>> >
> >>>> >to go through the main event queue. Since this rate is actually
> >>linked
> >>>> >
> >>>> >with simulator progress, it cannot guarantee that the packets are
> >>>> >serviced
> >>>> >
> >>>> >at regular intervals of real time. This can lead to packets
> >>queueing
> >>>>up
> >>>> >
> >>>> >which would contribute to the synchronization issues mentioned
> >>above.
> >>>> >
> >>>> >multi-gem5 uses plain sockets with separate receive threads and so
> >>does
> >>>> >not
> >>>> >
> >>>> >have this issue.
> >>>> >
> >>>> >I think again you are pointing to your first concern that I¹ve
> >>>>explained
> >>>> >above. Packets that have queued up in EtherTap socket, will be
> >>>>processed
> >>>> >and delivered to simulation environment at the beginning of next
> >>>> >simulation
> >>>> >quantum.
> >>>> >
> >>>> >Please notice that multi-gem5 introduces a new simObjects to
> >>interface
> >>>> >simulation environment to real world which is redundant. This
> >>>> >functionality
> >>>> >is already there by EtherTap.
> >>>>
> >>>> Except that the EtherTap solution does not provide a correct (robust)
> >>>> solution for the synchronization problem.
> >>>>
> >>>> Please read my first/second comments.
> >>>
> >>>
> >>>> >
> >>>> >* Checkpoint accuracy.
> >>>> >
> >>>> >A user would like to have a checkpoint at precisely the time the
> >>>> >
> >>>> >'m5 checkpoint' operation is executed so as to not miss any of the
> >>>> >
> >>>> >area of interest in his application.
> >>>> >
> >>>> >pd-gem5 requires that simulation finish the current quantum
> >>>> >
> >>>> >before checkpointing, so it cannot provide this.
> >>>> >
> >>>> >(Shortening the quantum can help, but usually the snapshot is being
> >>>>taken
> >>>> >
> >>>> >while 'fast-forwarding', i.e. simulating as fast as possible, which
> >>>>would
> >>>> >
> >>>> >motivate a longer quantum.)
> >>>> >
> >>>> >multi-gem5 can enter the drain cycle immediately upon receiving a
> >>>> >
> >>>> >checkpoint request. We find this accuracy highly desirable.
> >>>> >
> >>>> >It¹s true that if you have a large quantum size then there would be
> >>>>some
> >>>> >discrepancy between the m5_ckpt instruction tick and the actual dump
> >>>>tick.
> >>>> >Based on multi-gem5 code, my understanding is that you send async
> >>>> >checkpoint message as soon as one of the gem5 processes encounter
> >>>>m5_ckpt
> >>>> >instruction. But I¹m not sure how you fix the aforementioned issue,
> >>>> >because
> >>>> >you have to sync all gem5 processes before you start dumping
> >>>>checkpoint,
> >>>> >which necessitate a global synchronization beforehand.
> >>>>
> >>>> In multi-gem5, the gem5 process who encounters the m5_ckpt
> >>instruction
> >>>> sends out an async checkpoint notification for the peer gem5
> >>processes
> >>>>and
> >>>> then it starts the draining immediately (at the same tick). So the
> >>>> checkpoint will be taken at the exact tick form the initiator process
> >>>> point of view. The global synchronisation with the peer processes
> >>takes
> >>>> place while the initiator process is still waiting at the same tick
> >>(i.e
> >>>> the simulation thread is suspended). However, the receiver thread
> >>>> Continues reading out the socket - while waiting for the global sync
> >>to
> >>>> complete- to make sure that in-flight data packets from peer gem5
> >>>>processes
> >>>> are stored properly and saved into the checkpoint.
> >>>>
> >>>>
> >>>So you mean multi-gem5 ends up with having gem5 processes with
> >>different
> >>>ticks after checkpoint? In pd-gem5 we make sure that all gem5 processes
> >>>start dumping checkpoint at the same tick. Are you sure that this is
> >>>correct to have each gem5 process dump checkpoint at different ticks???
> >>>
> >>>I don't think this a correct checkpointing design. However, if you
> >>feel it
> >>>is correct, I can change a couple of lines in "Simulation.py" and
> >>barrier
> >>>scripts to implement the same functionality in pd-gem5. One thing that
> >>you
> >>>are obsessed about is to make sure that there is no in-flight packets
> >>>while
> >>>we start dumping checkpoint, and you have all these complex mechanisms
> >>in
> >>>place to insure that! I think you can 99.99999% make sure that there
> >>is no
> >>>in-flight packet by waiting for 1 second after all gem5 processes
> >>finished
> >>>their quantum simulation and then dump checkpoint. Do you really think
> >>>that
> >>>delivering a tcp packet would take more than 1 second in today's
> >>systems!?
> >>>Always go for simple solutions ...
> >>>
> >>>
> >>>
> >>>> >
> >>>> >By the way, we have a fix for this issue by introducing a new m5
> >>pseudo
> >>>> >instruction.
> >>>>
> >>>> I fail to see how a new pseudo instruction can solve the problem of
> >>>> completing the full quantum in pd-gem5 before a checkpoint can be
> >>taken.
> >>>> Could you please elaborate on that?
> >>>>
> >>>> As we take checkpoint while fast-forwarding and it is likely that we
> >>>>relax
> >>>synchronization for speedup purpose, a new pseudo instruction that can
> >>set
> >>>quantum size (m5_qset) can be helpful. So, one can insert m5_qset in
> >>his
> >>>benchmark source code before entering ROI that contains m5_ckpt to
> >>>decrease
> >>>quantum size beforehand and reduce the discrepancy between m5_ckpt tick
> >>>and
> >>>actual checkpoint tick. This is not included in pd-gem5 patch right
> >>now.
> >>>
> >>>
> >>>> >
> >>>> >* Implementation of network topology.
> >>>> >
> >>>> >pd-gem5 uses a separate gem5 process to act as a switch whereas
> >>>>multi-gem5
> >>>> >
> >>>> >uses a standalone packet relay process.
> >>>> >
> >>>> >We haven't measured the overhead of pd-gem5's simulated switch yet,
> >>but
> >>>> >
> >>>> >we're confident that our approach is at least as fast and more
> >>>>scalable.
> >>>> >
> >>>> >There is this flexibility in pd-gem5 to simulate a switch box
> >>alongside
> >>>> >one
> >>>> >of the other gem5 processes. However, it might make that gem5
> >>process
> >>>>the
> >>>> >simulation bottleneck. One of the advantages of pd-gem5 over
> >>>>multi-gem5 is
> >>>> >that we use gem5 to simulate a switch box, which allows us to model
> >>any
> >>>> >network topology by instantiating several Switch simObjects and
> >>>> >interconnect them with EhterLink in an arbitrary fashion. A
> >>standalone
> >>>>tcp
> >>>> >server just can provide switch functionality (forwarding packets to
> >>>> >destinations) and model a star network topology. Furthermore, it
> >>cannot
> >>>> >model various network timings such as queueing delay, congestion,
> >>and
> >>>> >routing latency. Also it has some accuracy issues that I will point
> >>out
> >>>> >next.
> >>>>
> >>>> I agree with the complex topology argument. We already mentioned that
> >>>> before as an advantage for pd-gem5 from the point of view of future
> >>>> extensions. However, I do not agree that multi-gem5 cannot model
> >>>>queueing
> >>>> delays and congestions. For a simple crossbar switch, it can model
> >>>>queueing
> >>>> delays and congestions, but the receive queues are distributed among
> >>the
> >>>> gem5 processes.
> >>>>
> >>>> It's true that you can model queuing delay of a simple crossbar by
> >>>distributing queues across gem5 processes (end points). But to be able
> >>to
> >>>do so you have to ensure the ordering of packets that you enqueue in
> >>the
> >>>distributed queues. It is almost impossible without a synchronized
> >>switch
> >>>box. You should have a reorder queue that reorders packets dynamically
> >>and
> >>>updates timing parameter for each packet as well. I don't know how much
> >>>progress have you had to ensure ordering scheme in multi-gem5 but you
> >>may
> >>>already realized that how complex and error prone it can be. This
> >>argument
> >>>is also related to my next argument for "Broken network timing".
> >>>
> >>>
> >>>> >
> >>>> >* Broken network timing:
> >>>> >
> >>>> >Forwarding packets between gem5 processes using a standalone tcp
> >>server
> >>>> >can
> >>>> >cause reordering between packets that have different source but same
> >>>> >destination. It causes inaccurate network timing and worse of all
> >>>> >non-deterministic simulation. pd-gem5 resolve this by reordering
> >>>>packets
> >>>> >at
> >>>> >Switch process and then send them to their destination (it¹s
> >>possible
> >>>>as
> >>>> >switch is synchronized with the rest of the nodes).
> >>>>
> >>>> In multi-gem5, there is always a HeaderPkt that contains some meta
> >>>> information for each data packet. The meta information include the
> >>send
> >>>> tick and the sender rank (i.e. a unique ID of the sender gem5
> >>process).
> >>>> We use those information to define a well defined ordering of packets
> >>>>even
> >>>> if packets are arriving at the same receiver from different senders.
> >>>>This
> >>>> packet ordering scheme is still being tested so the corresponding
> >>patch
> >>>>is
> >>>> not on the RB yet.
> >>>>
> >>>> Please read my previous comment. The most important part of
> >>>>multi/pd-gem5
> >>>extension is ensuring accurate and deterministic simulation.
> >>>
> >>>
> >>>> >
> >>>> >* Amount of changes
> >>>> >
> >>>> >pd-gem5 introduce different modes in etherlink just to provide
> >>accurate
> >>>> >timing for each component in the network subsystem (NIC, link,
> >>switch)
> >>>>as
> >>>> >well as capability of modeling different network topologies (mesh,
> >>>>ring,
> >>>> >fat tree, etc). To enable a simple functionality, like what
> >>multi-gem5
> >>>> >provides, the amount of changes in gem5 can be limited to
> >>time-stamping
> >>>> >packets and providing synchronization through python scripts.
> >>However,
> >>>> >multi-gem5 re-implements functionalists that are already in gem5.
> >>>>
> >>>> This argument holds only if both implementations are correct
> >>(robust).
> >>>>It
> >>>> still seems to me that pd-gem5 does not provide correctness for the
> >>>> synchronization/checkpointing parts.
> >>>>
> >>>> Again, please read my first comment for correctness of pd-gem5.
> >>>
> >>>
> >>>> >
> >>>> >* Integrating with gem5 mainstream:
> >>>> >
> >>>> >pd-gem5 launch script is written in python which is suited for
> >>>>integration
> >>>> >with gem5 python scripts. However multi-gem5 uses bash script. Also,
> >>>>all
> >>>> >source files in pd-gem5 are already parts of gem5 mainstream.
> >>However
> >>>> >multi-gem5 has tcp_server.cc/hh that is a standalone process and
> >>cannot
> >>>> be
> >>>> >part of gem5.
> >>>>
> >>>> The multi-gem5 launch script is simply enough to rely only on the
> >>>>shell. It
> >>>> can obviously be easily re-written in python if that added any value.
> >>>>The
> >>>> tcp_server component is only a utility (like the "m5" utility that is
> >>>>also
> >>>> part of gem5).
> >>>>
> >>>> The thing is that it's more likely that users want to add some
> >>>functionality to the run-script of multi/pd-gem5. E.g. pd-gem5
> >>run-script
> >>>supports launching simulations using a simulation pool management
> >>>software (
> >>>http://research.cs.wisc.edu/htcondor/). Using python enables users to
> >>>easily add these kind of supports.
> >>>
> >>>
> >>>>
> >>>> Cheers,
> >>>> - Gabor
> >>>>
> >>>>
> >>>> >On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham
> >><***@arm.com>
> >>>> >wrote:
> >>>> >
> >>>> >>Hello everyone,
> >>>> >>We have taken a look at how pd-gem5 compares with multi-gem5.
> >>While
> >>>> >>intending
> >>>> >>to deliver the same functionality, there are some crucial
> >>differences:
> >>>> >>
> >>>> >>* Synchronization.
> >>>> >>
> >>>> >> pd-gem5 implements this in Python (not a problem in itself;
> >>>> >>aesthetically
> >>>> >> this is nice, but...). The issue is that pd-gem5's data
> >>packets
> >>>>and
> >>>> >> barrier messages travel over different sockets. Since pd-gem5
> >>>>could
> >>>> >>see
> >>>> >> data packets passing synchronization barriers, it could create
> >>an
> >>>> >> inconsistent checkpoint.
> >>>> >>
> >>>> >> multi-gem5's synchronization is implemented in C++ using sync
> >>>>events,
> >>>> >>but
> >>>> >> more importantly, the messages queue up in the same stream and
> >>so
> >>>> >>cannot
> >>>> >> have the issue just described. (Event ordering is often
> >>crucial
> >>>>in
> >>>> >> snapshot protocols.) Therefore we feel that multi-gem5 is a
> >>more
> >>>> >>robust
> >>>> >> solution in this respect.
> >>>> >>
> >>>> >>* Packet handling.
> >>>> >>
> >>>> >> pd-gem5 uses EtherTap for data packets but changed the polling
> >>>> >>mechanism
> >>>> >> to go through the main event queue. Since this rate is
> >>actually
> >>>> >>linked
> >>>> >> with simulator progress, it cannot guarantee that the packets
> >>are
> >>>> >>serviced
> >>>> >> at regular intervals of real time. This can lead to packets
> >>>> >>queueing up
> >>>> >> which would contribute to the synchronization issues mentioned
> >>>>above.
> >>>> >>
> >>>> >> multi-gem5 uses plain sockets with separate receive threads
> >>and so
> >>>> >>does
> >>>> >>not
> >>>> >> have this issue.
> >>>> >>
> >>>> >>* Checkpoint accuracy.
> >>>> >>
> >>>> >> A user would like to have a checkpoint at precisely the time the
> >>>> >> 'm5 checkpoint' operation is executed so as to not miss any of
> >>the
> >>>> >> area of interest in his application.
> >>>> >>
> >>>> >> pd-gem5 requires that simulation finish the current quantum
> >>>> >> before checkpointing, so it cannot provide this.
> >>>> >>
> >>>> >> (Shortening the quantum can help, but usually the snapshot is
> >>being
> >>>> >>taken
> >>>> >> while 'fast-forwarding', i.e. simulating as fast as possible,
> >>which
> >>>> >>would
> >>>> >> motivate a longer quantum.)
> >>>> >>
> >>>> >> multi-gem5 can enter the drain cycle immediately upon receiving
> >>a
> >>>> >> checkpoint request. We find this accuracy highly desirable.
> >>>> >>
> >>>> >>* Implementation of network topology.
> >>>> >>
> >>>> >> pd-gem5 uses a separate gem5 process to act as a switch whereas
> >>>> >>multi-gem5
> >>>> >> uses a standalone packet relay process.
> >>>> >>
> >>>> >> We haven't measured the overhead of pd-gem5's simulated switch
> >>yet,
> >>>> >>but
> >>>> >> we're confident that our approach is at least as fast and more
> >>>> >>scalable.
> >>>> >>
> >>>> >>
> >>>> >>Thanks,
> >>>> >>Curtis
> >>>> >>________________________________________
> >>>> >>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad
> >>>>Alian [
> >>>> >>***@wisc.edu]
> >>>> >>Sent: Friday, June 26, 2015 7:37 PM
> >>>> >>To: gem5 Developer List
> >>>> >>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
> >>>> >>system
> >>>> >>on multiple physical hosts
> >>>> >>
> >>>> >>Hi Anthony,
> >>>> >>
> >>>> >>I think that would be a good option, then I can add pd-gem5
> >>>> >>functionality
> >>>> >>on top of that. Right now I've simplified your implementation.
> >>Also, I
> >>>> >>think I had found some bugs in your patch that I cannot remember
> >>now.
> >>>>If
> >>>> >>you decided to ship EtherSwitch patch, let me know to give you a
> >>>>review
> >>>> >>on
> >>>> >>that.
> >>>> >>
> >>>> >>Thanks,
> >>>> >>Mohammad
> >>>> >>
> >>>> >>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
> >>>> >>***@amd.com> wrote:
> >>>> >>
> >>>> >>>Would it make sense for me to ship the EtherSwitch patch first,
> >>since
> >>>> >>it
> >>>> >>>has utility on its own, and then we can decide which of the
> >>>> >>"multi-gem5"
> >>>> >>>approaches is best, or if it's some combination of both?
> >>>> >>>
> >>>> >>>The only reason I never shipped it was because Steve raised an
> >>issue
> >>>> >>that
> >>>> >>>I didn't have a good alternative for, and didn't have the time to
> >>>>look
> >>>> >>into
> >>>> >>>one at that time.
> >>>> >>>________________________________________
> >>>> >>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
> >>>> >>Alian [
> >>>> >>>***@wisc.edu]
> >>>> >>>Sent: Wednesday, June 24, 2015 12:43 PM
> >>>> >>>To: gem5 Developer List
> >>>> >>>Subject: Re: [gem5-dev] pd-gem5: simulating a parallel/distributed
> >>>> >>system
> >>>> >>>on multiple physical hosts
> >>>> >>>
> >>>> >>>Hi Andreas,
> >>>> >>>
> >>>> >>>Thanks for the comment.
> >>>> >>>I think the checkpointing support in both works is the same. Here
> >>is
> >>>> >>how
> >>>> >>>checkpointing support is implemented in pd-gem5:
> >>>> >>>
> >>>> >>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> >>>> >>>instruction, it will send a ³recv-ckpt² signal to the
> >>>> >>>³barrier² process. Then the ³barrier² process sends a ³take-ckpt²
> >>>> >>signal
> >>>> >>to
> >>>> >>>all the simulated nodes
> >>>> >>>(including the node that encountered m5-checkpoint) at the end of
> >>the
> >>>> >>>current simulation quantum. On the reception of
> >>>> >>>³take-ckpt² signal, gem5 processes start dumping check-points.
> >>This
> >>>> >>makes
> >>>> >>>each simulated node dump a checkpoint
> >>>> >>>at the same simulated time point while ensuring there is no
> >>in-flight
> >>>> >>>packets.
> >>>> >>>
> >>>> >>>I believe this is the same as multi-gem5 patch approach for
> >>>>checkpoint
> >>>> >>>support (based on the commit message of
> >>>> >>http://reviews.gem5.org/r/2865/
> >>>> >>).
> >>>> >>>Also, we have tested our mechanism with several benchmarks and it
> >>>> >>works.
> >>>> >>As
> >>>> >>>Steve suggested, I'll look into Curtis's patch and try to review
> >>it
> >>>>as
> >>>> >>>well.
> >>>> >>>But as Nilay also mentioned earlier, there are some codes missing
> >>in
> >>>> >>>Curtis's patch. I prefer to first run multi-gem5 before starting
> >>to
> >>>> >>review
> >>>> >>>it.
> >>>> >>>
> >>>> >>>Thank you,
> >>>> >>>Mohammad
> >>>> >>>
> >>>> >>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
> >>>> >>***@arm.com>
> >>>> >>>wrote:
> >>>> >>>
> >>>> >>>>Hi Steve,
> >>>> >>>>
> >>>> >>>>Apologies for the confusion. We are on the same page. My point is
> >>>> >>that
> >>>> >>we
> >>>> >>>>cannot simply take a little bit of patch A and a little bit of
> >>>> >>patch B.
> >>>> >>>>This change involves a lot of code, and we need to approach this
> >>in
> >>>> >>a
> >>>> >>>>structured fashion. My proposal is to do it bottom up, and start
> >>by
> >>>> >>>>getting the basic support in place. Since
> >>>> >>>http://reviews.gem5.org/r/2826/
> >>>> >>>>has already been on the review board for a few months, I am
> >>merely
> >>>> >>>>suggesting that the it would be a good start to relate the newly
> >>>> >>posted
> >>>> >>>>patches to what is already there.
> >>>> >>>>
> >>>> >>>>Andreas
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> >>>> >>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com> wrote:
> >>>> >>>>
> >>>> >>>>>Hi Andreas,
> >>>> >>>>>
> >>>> >>>>>I'm a little confused by your email---you say you're
> >>fundamentally
> >>>> >>>opposed
> >>>> >>>>>to looking at both patches and picking the best features, then
> >>you
> >>>> >>point
> >>>> >>>>>out that the patches Curtis posted have the feature of better
> >>>> >>>>>checkpointing
> >>>> >>>>>support so we should pick that :).
> >>>> >>>>>
> >>>> >>>>>Obviously we can't just pick patch A from Mohammad's set and
> >>patch
> >>>> >>B
> >>>> >>>from
> >>>> >>>>>Curtis's set and expect them to work together, but I think that
> >>>> >>having
> >>>> >>>>>both
> >>>> >>>>>sets of patches available and comparing and contrasting the two
> >>>> >>>>>implementations should enable us to get to a single
> >>implementation
> >>>> >>>that's
> >>>> >>>>>the best of both. Someone will have to make the effort of
> >>>> >>integrating
> >>>> >>>the
> >>>> >>>>>better ideas from one set into the other set to create a new
> >>>> >>unified
> >>>> >>set
> >>>> >>>>>of
> >>>> >>>>>patches; (or maybe we commit one set and then integrate the
> >>best of
> >>>> >>the
> >>>> >>>>>other set as patches on top of that), but the first step is to
> >>>> >>identify
> >>>> >>>>>what "the best of both" is. Having Mohammad look at Curtis's
> >>>> >>patches,
> >>>> >>>and
> >>>> >>>>>Curtis (or someone else from ARM) closely examine Mohammad's
> >>>> >>patches
> >>>> >>>would
> >>>> >>>>>be a great start. I intend to review them both, though
> >>>> >>unfortunately
> >>>> >>my
> >>>> >>>>>time has been scarce lately---I'm hoping to squeeze that in
> >>later
> >>>> >>this
> >>>> >>>>>week.
> >>>> >>>>>
> >>>> >>>>>Once we've had a few people look at both, we can discuss the
> >>pros
> >>>> >>and
> >>>> >>>cons
> >>>> >>>>>of each, then discuss the strategy for getting the best features
> >>>> >>in.
> >>>> >>So
> >>>> >>>>>far I've heard that Mohammad's patches have a better network
> >>model
> >>>> >>but
> >>>> >>>the
> >>>> >>>>>ARM patches have better checkpointing support; that seems like a
> >>>> >>good
> >>>> >>>>>start.
> >>>> >>>>>
> >>>> >>>>>Steve
> >>>> >>>>>
> >>>> >>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> >>>> >>>***@arm.com
> >>>> >>>>>
> >>>> >>>>>wrote:
> >>>> >>>>>
> >>>> >>>>>>Hi all,
> >>>> >>>>>>
> >>>> >>>>>>Great work. However, I fundamentally do not believe in the
> >>>> >>approach
> >>>> >>of
> >>>> >>>>>>Œletting reviewers pick the best features¹. There is no way we
> >>>> >>would
> >>>> >>>>>>ever
> >>>> >>>>>>get something working out if it. We need to get _one_ working
> >>>> >>solution
> >>>> >>>>>>here, and figure out how to best get there. I would propose to
> >>>> >>do it
> >>>> >>>>>>bottom up, starting with the basic multi-simulator instance
> >>>> >>support,
> >>>> >>>>>>checkpointing support, and then move on to the network between
> >>>> >>the
> >>>> >>>>>>simulator instances.
> >>>> >>>>>>
> >>>> >>>>>>Thus, I propose we go with the low-level plumbing and
> >>checkpoint
> >>>> >>>support
> >>>> >>>>>>from what Curtis has posted. I believe proper checkpointing
> >>>> >>support
> >>>> >>to
> >>>> >>>>>>be
> >>>> >>>>>>the most challenging, and from what I can tell this is far more
> >>>> >>>limited
> >>>> >>>>>>in
> >>>> >>>>>>what you just posted Mohammad. Could you perhaps review Curtis
> >>>> >>patches
> >>>> >>>>>>based on your insights, and we can try and get these patches in
> >>>> >>shape
> >>>> >>>>>>and
> >>>> >>>>>>committed asap.
> >>>> >>>>>>
> >>>> >>>>>>Once we have the baseline functionality in place, then we can
> >>>> >>start
> >>>> >>>>>>looking at the more elaborate network models.
> >>>> >>>>>>
> >>>> >>>>>>Does this sound reasonable?
> >>>> >>>>>>
> >>>> >>>>>>Thanks,
> >>>> >>>>>>
> >>>> >>>>>>Andreas
> >>>> >>>>>>
> >>>> >>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> >>>> >>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> >>>> >>>>>>
> >>>> >>>>>>>Hello All,
> >>>> >>>>>>>
> >>>> >>>>>>>I have submitted a chain of patches which enables gem5 to
> >>>> >>simulate
> >>>> >>a
> >>>> >>>>>>>cluster on multiple physical hosts:
> >>>> >>>>>>>
> >>>> >>>>>>>http://reviews.gem5.org/r/2909/
> >>>> >>>>>>>http://reviews.gem5.org/r/2910/
> >>>> >>>>>>>http://reviews.gem5.org/r/2912/
> >>>> >>>>>>>http://reviews.gem5.org/r/2913/
> >>>> >>>>>>>http://reviews.gem5.org/r/2914/
> >>>> >><http://reviews.gem5.org/r/2914/>
> >>>> >>>>>>>
> >>>> >>>>>>>and a patch that contains run scripts for a simple experiment:
> >>>> >>>>>>>http://reviews.gem5.org/r/2915/
> >>>> >>>>>>>
> >>>> >>>>>>>We have run several benchmarks using this infrastructure,
> >>>> >>including
> >>>> >>>NAS
> >>>> >>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
> >>>> >>>>>>>(http://prof.ict.ac.cn/DCBench/),
> >>>> >>>>>>>and would be happy to share scripts/diskimages.
> >>>> >>>>>>>
> >>>> >>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or
> >>less
> >>>> >>the
> >>>> >>>>>>same
> >>>> >>>>>>>as
> >>>> >>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
> >>>> >>*network
> >>>> >>>>>>model
> >>>> >>>>>>>is
> >>>> >>>>>>>more thorough; it also enables modeling different network
> >>>> >>topologies.
> >>>> >>>>>>>Having both set of changes together let reviewers to pick best
> >>>> >>>features
> >>>> >>>>>>>from both works.
> >>>> >>>>>>>
> >>>> >>>>>>>Thank you,
> >>>> >>>>>>>Mohammad Alian
> >>>> >>>>>>>_______________________________________________
> >>>> >>>>>>>gem5-dev mailing list
> >>>> >>>>>>>gem5-***@gem5.org
> >>>> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> >>>>>>
> >>>> >>>>>>
> >>>> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >>>> >>attachments
> >>>> >>>are
> >>>> >>>>>>confidential and may also be privileged. If you are not the
> >>>> >>intended
> >>>> >>>>>>recipient, please notify the sender immediately and do not
> >>>> >>disclose
> >>>> >>>the
> >>>> >>>>>>contents to any other person, use it for any purpose, or store
> >>or
> >>>> >>copy
> >>>> >>>>>>the
> >>>> >>>>>>information in any medium. Thank you.
> >>>> >>>>>>
> >>>> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >>>> >>9NJ,
> >>>> >>>>>>Registered in England & Wales, Company No: 2557590
> >>>> >>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
> >>Cambridge
> >>>> >>CB1
> >>>> >>>>>>9NJ,
> >>>> >>>>>>Registered in England & Wales, Company No: 2548782
> >>>> >>>>>>_______________________________________________
> >>>> >>>>>>gem5-dev mailing list
> >>>> >>>>>>gem5-***@gem5.org
> >>>> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> >>>>>>
> >>>> >>>>>_______________________________________________
> >>>> >>>>>gem5-dev mailing list
> >>>> >>>>>gem5-***@gem5.org
> >>>> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> >>>>
> >>>> >>>>
> >>>> >>>>-- IMPORTANT NOTICE: The contents of this email and any
> >>attachments
> >>>> >>are
> >>>> >>>>confidential and may also be privileged. If you are not the
> >>intended
> >>>> >>>>recipient, please notify the sender immediately and do not
> >>disclose
> >>>> >>the
> >>>> >>>>contents to any other person, use it for any purpose, or store or
> >>>> >>copy
> >>>> >>>the
> >>>> >>>>information in any medium. Thank you.
> >>>> >>>>
> >>>> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >>>> >>>>Registered in England & Wales, Company No: 2557590
> >>>> >>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> >>CB1
> >>>> >>9NJ,
> >>>> >>>>Registered in England & Wales, Company No: 2548782
> >>>> >>>>_______________________________________________
> >>>> >>>>gem5-dev mailing list
> >>>> >>>>gem5-***@gem5.org
> >>>> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> >>>>
> >>>> >>>_______________________________________________
> >>>> >>>gem5-dev mailing list
> >>>> >>>gem5-***@gem5.org
> >>>> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> >>>_______________________________________________
> >>>> >>>gem5-dev mailing list
> >>>> >>>gem5-***@gem5.org
> >>>> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> >>>
> >>>> >>_______________________________________________
> >>>> >>gem5-dev mailing list
> >>>> >>gem5-***@gem5.org
> >>>> >>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> >>
> >>>> >>-- IMPORTANT NOTICE: The contents of this email and any attachments
> >>>>are
> >>>> >>confidential and may also be privileged. If you are not the
> >>intended
> >>>> >>recipient, please notify the sender immediately and do not disclose
> >>>>the
> >>>> >>contents to any other person, use it for any purpose, or store or
> >>copy
> >>>> >>the
> >>>> >>information in any medium. Thank you.
> >>>> >>
> >>>> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >>>> >>Registered in England & Wales, Company No: 2557590
> >>>> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> >>CB1
> >>>> >>9NJ,
> >>>> >>Registered in England & Wales, Company No: 2548782
> >>>> >>
> >>>> >>_______________________________________________
> >>>> >>gem5-dev mailing list
> >>>> >>gem5-***@gem5.org
> >>>> >>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>> >>
> >>>> >_______________________________________________
> >>>> >gem5-dev mailing list
> >>>> >gem5-***@gem5.org
> >>>> >http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> -- IMPORTANT NOTICE: The contents of this email and any attachments
> >>are
> >>>> confidential and may also be privileged. If you are not the intended
> >>>> recipient, please notify the sender immediately and do not disclose
> >>the
> >>>> contents to any other person, use it for any purpose, or store or
> >>copy
> >>>>the
> >>>> information in any medium. Thank you.
> >>>>
> >>>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >>>> Registered in England & Wales, Company No: 2557590
> >>>> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>>>9NJ,
> >>>> Registered in England & Wales, Company No: 2548782
> >>>> _______________________________________________
> >>>> gem5-dev mailing list
> >>>> gem5-***@gem5.org
> >>>> http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>
> >>>_______________________________________________
> >>>gem5-dev mailing list
> >>>gem5-***@gem5.org
> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >>
> >>-- IMPORTANT NOTICE: The contents of this email and any attachments are
> >>confidential and may also be privileged. If you are not the intended
> >>recipient, please notify the sender immediately and do not disclose the
> >>contents to any other person, use it for any purpose, or store or copy
> >>the
> >>information in any medium. Thank you.
> >>
> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >>Registered in England & Wales, Company No: 2557590
> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >>Registered in England & Wales, Company No: 2548782
> >>_______________________________________________
> >>gem5-dev mailing list
> >>gem5-***@gem5.org
> >>http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >_______________________________________________
> >gem5-dev mailing list
> >gem5-***@gem5.org
> >http://m5sim.org/mailman/listinfo/gem5-dev
>
>
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabor Dozsa
2015-07-06 16:45:10 UTC
Permalink
Thank you Steve for the detailed elaboration on the issues.


Regarding the “unsynchronized checkpoints”, the terminology might be a bit
confusing. In fact, we always need to do a global synchronization among
the gem5 processes before taking a distributed checkpoint (in order to
avoid in-flight packets). The global synchronization here means that each
gem5 has to suspend the simulation and wait until every in-flight packets
arrives (and is stored) at the destination gem5 process. If that global
synchronization step happens at the same simulated tick in each gem5 then
the we call the checkpoint “synchronous” otherwise it is an “asynchronous”
checkpoint.

In the MPI application example I mentioned before the checkpoint should be
triggered as soon as the “slowest” MPI process reaches the MPI_barrier().
The problem is that the “slowest” MPI process usually does not reach the
MPI_barrier() right at the end of the current quantum. If we let the
simulation continue until the quantum completes (to ensure that the
checkpoint is taken at the same simulated tick in each gem5) then the MPI
processes will complete the MPI_barrier and start executing the ROI code
already.

Regarding the integration of multi-threaded/multi-host simulation,
multi-gem5 does not support fine grain simulation of hierarchical switches
(or any other network topologies except a single crossbar) or multiple
synchronization domains currently.

However, I'm a bit confused about your statement that you don’t see value
in ever building a shared-memory transport for MultiIface. MultiIface in
my view is just an abstract interface for “multi-(ether)-link" objects
which are link objects for connecting multiple (i.e. more than two)
systems. It aims to encapsulate the API necessary for any Link object
in a any multi-system configuration - provided that we partition the
systems across network links during run time.

An orthogonal issue is if we want to include a simple crossbar switch
model in a MultiIface implementation or we want to provide a ‘standalone'
fine grain model for the switch (e.g. the pd-gem5 approach).

Thanks,
- Gabor



On 7/3/15, 7:33 PM, "Steve Reinhardt" <***@gmail.com> wrote:

>Thanks Mohammad & Gabor for the responses.
>
>I think there's still some misunderstanding on what I mean by the
>integration of multi-threaded and multi-host simulation based on Gabor's
>response above and Andreas's response in the other thread.
>
>The primary example scenario I'm proposing is as Mohammad described:
>within
>each host node, we're simulating an entire rack + top-of-rack switch in a
>single gem5 process, with separate event queues/threads being used to
>parallelize across nodes within the rack. The switch may or may not be on
>its own thread as well. The synchronization among the threads only needs
>to be at the granularity of the intra-rack network latency.
>
>Now we want to expand this by using pd-gem5 or multi-gem5 to parallelize
>multiple of these rack-level simulations across hosts, so we can simulate
>a
>whole row of a datacenter. Only the uplinks from the TOR switches would
>need to go over sockets between processes, and the switch being modeled by
>pd-gem5 or multi-gem5 would be the end-of-row switch. The synchronization
>delay among the multiple gem5 processes would be based on the inter-rack
>latency.
>
>So the basic question is: Is this feasible with pd-gem5 / multi-gem5, and
>if not, how much work would it take to make it so?
>
>However, my larger point is that I still don't see value in ever building
>a
>shared-memory transport for MultiIface. For this model, there is clearly
>no
>need for it. Things get more complicated if we want to do something like
>have N nodes connected to a single switch and split that over two hosts
>(with N/2 nodes simulated on each), but even in that case, I think it's a
>better idea to make the switch model deal with having half of its links
>internal and half external (since we already want the same model to work
>in
>both the all-internal and all-external cases). Not that I'm worried that
>someone is about to go off and build this shared-memory transport, but I
>think it's important to reach an understanding here, since it's
>fundamental
>to defining the strategic relationship between these capabilities going
>forward.
>
>Stepping back a little further, it would be nice to have a model that is
>as
>generic as the multi-threading model, where it's really just a matter of
>taking a simulation, partitioning the components among the threads, and
>setting the synchronization quantum, and it works. Of course, even with
>the
>multi-threaded model, if you don't choose your partitioning and your
>quantum wisely, you're not going to get much speedup or a deterministic
>simulation, but the fundamental implementation is oblivious to that. I'm
>not saying we really need to go all the way to this extreme---it's pretty
>reasonable to assume that no one in the near future will want to partition
>across hosts anywhere other than on a simulated network link---but I think
>we should keep this ideal in mind as a guiding principle as we choose how
>to go forward from here.
>
>This ties in to my point #4, which is that if we're really building a
>mechanism to partition a simulation across multiple hosts, then you should
>be able to run the same simulation in a single gem5 process and get the
>same results. I think this is the strength of pd-gem5; correspondingly the
>main weakness of multi-gem5 is that it architecturally feels more like
>tying together a set of mostly independent gem5 simulations than like
>partitioning a single gem5 simulation. (Of course, they both end up at
>roughly the same point in the middle.)
>
>On the flip side, multi-gem5 has some clear advantages in terms of the
>better separation of the communication layer (and I can imagine it being
>very useful to port to MPI and perhaps some RDMA API for InfiniBand
>clusters). Also I think the integrated sockets for communication and
>syncrhonization are the superior design; while the separate sockets used
>by
>pd-gem5 may only very rarely cause problems, I agree with Andreas that
>that's not good enough, and I don't see any real advantage either---if you
>have to flush the data sockets (or wait for them to drain) before
>synchronizing, then you might as well just have the synchronization
>messages queue up behind the data messages.
>
>Regarding unsynchronized checkpoints: Thanks for the example, but I'm
>still
>a little confused. If all the processes are about to execute an
>MPI_Barrier(), doesn't that mean they'll all be synchronized shortly
>anyway? So what's the harm until waiting until they're synchronized and
>then checkpointing?
>
>Regarding the simulation of non-Ethernet networks: I agree that the
>biggest
>obstacle to this is the lack of generality of the current gem5 network
>components. I tried to take a step toward supporting other link types two
>years ago (see http://reviews.gem5.org/r/1922) but someone shot me down
>;).
>We shouldn't try and fix that here, but we should also consciously try not
>to make it any worse...
>
>Thanks for reading all the way to the end!
>
>Steve
>
>
>On Fri, Jul 3, 2015 at 7:11 AM Gabor Dozsa <***@arm.com> wrote:
>
>>Hi all,
>>
>>Thank you Steve for the thorough review.
>>
>>First, let me elaborate a bit on Andreas’s 3rd point about
>>non-synchronous
>>checkpoints. Let’s assume that we aim to simulate MPI applications (HPC
>>workloads). The ROI in an MPI application is typically starts with a
>>global MPI_Barrier() call. We want to take the checkpoint when *every*
>>gem5 process is reached that MPI_Barrier() in the simulated code but
>>that
>>may not happen at the same tick in each gem5 (due to load imbalance
>>among
>>the simulated nodes). That’s why multi-gem5 implements the
>>non-synchronous
>>checkpoint support.
>>
>>My answers to your questions are as follows.
>>
>>1. The only change necessary to use multi-gem5 with a non Ethernet
>>(simulated) network is to replace the Ethernet packet type with another
>>packet type in MultiIface.
>>In fact, the first implementation of MultiIface was a template
>>that took EthPacketData as parameter because I plan to support different
>>network types. When I realized that currently only Ethernet is supported
>>by gem5 I dropped the template param to keep the implementation
>>simpler. I
>>have also realized in the meantime that the right approach would
>>probably
>>be to create a pure virtual ‘base' class for network packets from which
>>Ethernet (and other types of) packets could be derived. Then MultiIface
>>could simply use that base class to provide support for different
>>network
>>types. The interface provided by the base packet class could be very
>>simple. Beside the total size() of the packet, multi-gem5 only needs a
>>method to ‘extract' the source/destination address. Those addresses are
>>used in MultiIface as opaque byte arrays so they are quite network type
>>agnostic already.
>>
>>2. That’s right, we have designed the MultiIface/TCPIface split with
>>different underlaying messaging systems in mind.
>>
>>3. Multi-gem5 can work together with multi-threaded/multi-event-queue
>>gem5
>>configs. The current TCPIface/tcp_server components would still use
>>sockets to send around the packets. So it is possible to put together a
>>multi-gem5 simulation where each gem5 process has multiple event queues
>>(and an independent simulation thread per event queue) but all the
>>simulated Ethernet links would use sockets to forward every Ethernet
>>packet to the tcp_server.
>>
>>If someone wanted to run only a single gem5 process to simulate an
>>entire
>>cluster (using one thread/event-queue per cluster node) then the current
>>multi-gem5 implementation using sockets/tcp_server is not optimal. In
>>that
>>case, a better solution would be to provide a shared memory based
>>implementation of the MultiIface virtual communication methods
>>sendRaw()/recvRaw()/syncRaw() (i.e. a shared memory equivalent of
>>TCPIface). In that implementation, the entire discrete tcp_sever
>>component
>>could be replaced with a shared data structure.
>>
>>4. You are right, the current implementation does not make it possible
>>to
>>construct an equivalent single-process simulation model for a multi-gem5
>>run. However, a possible solution is a shared memory based
>>implementation
>>of the MultiIface virtual communication methods just as I described in
>>the
>>previous paragraph. The same implementation could then work with both
>>multi-threaded/multi-event-queues and single-thread/single-event-queue
>>gem5 configs.
>>
>>Thanks,
>>- Gabor
>>
>>On 7/2/15, 7:20 PM, "Steve Reinhardt" <***@gmail.com> wrote:
>>
>>>Hi everyone,
>>>
>>>Sorry for taking so long to engage. This is a great development and I
>>>think
>>>both these patches are terrific contributions. Thanks to Mohammad,
>>Gabor,
>>>and everyone else involved.
>>>
>>>I agree with Andreas that we should start with some top-level goals &
>>>assumptions, agree on those, and then we can sort out the detailed
>>issues
>>>based on a consistent view.
>>>
>>>I definitely agree with Andreas's first two points. The third one
>>seems a
>>>little surprising; I'd like to hear more about the motivation before
>>>expressing an opinion. I can see where non-synchronous checkpointing
>>could
>>>be useful, but it's also clear from the associated patch that it's not
>>>trivial to implement either. How much would be lost by requiring a
>>>synchronization before a checkpoint?
>>>
>>>From my personal perspective, I would like to see whatever we do here
>>be a
>>>first step toward a more general distributed simulation platform. Both
>>of
>>>these patches seem pretty Ethernet-centric in different ways. This is
>>not
>>>terrible; part of the problem is that gem5's current internal
>>networking
>>>support is already overly Ethernet-centric IMO. But it would be nice to
>>>avoid baking that in even further. Rather than assume I have understood
>>>all
>>>the code completely, I'll phrase things in the form of questions, and
>>>people can comment on how those questions would be answered in the
>>context
>>>of the two different approaches.
>>>
>>>1. How much effort would be required to simulate a non-Ethernet
>>network?
>>>My
>>>impression is that pd-gem5 has a leg up here, since a gem5 switch model
>>>for
>>>a non-Ethernet network (which you'd have to write anyway if you were
>>>simulating a different network) could be used in place of the current
>>>Ethernet switch, where for multi-gem5 I think that the
>>>util/multi//tcp_server.cc code would have to be modified (i.e.,
>>there'd be
>>>additional work above and beyond what you'd need to get the network
>>>modeled
>>>in base gem5).
>>>
>>>2. How much effort is required to run on a non-Ethernet network (or
>>>equivalently using a non-sockets API)? The MultiIface/TCPIface split
>>in
>>>the multi-gem5 code looks like it addresses this nicely, but pd-gem5
>>seems
>>>pretty tied to an Ethernet host fabric.
>>>
>>>3. Do both of these patches work with the existing multithreaded
>>>multiple-event-queue simulation? I think multi-gem5 does (though it
>>would
>>>be nice to have a confirmation), but it's not clear about pd-gem5. I
>>don't
>>>see a benefit to having multiple gem5 processes on a single host vs. a
>>>single multithreaded gem5 process using the existing support. I think
>>this
>>>could be particularly valuable with a hierarchical network; e.g.,
>>maybe I
>>>would want to model a rack in multithreaded mode on a single multicore
>>>server, then use pd-gem5 or multi-gem5 to build up a simulation of
>>>multiple
>>>racks. Would this work out of the box with either of these patches,
>>and if
>>>not, what would need to be done?
>>>
>>>4. Is it possible to construct a single-process simulation model that's
>>>identical to the distributed simulation? It would be very valuable for
>>>verification to be able to take a single simulation run and do it both
>>>within a single process and also across multiple processes and verify
>>that
>>>identical results are achieved. This seems like a big drawback to the
>>>multi-gem5 tcp_server approach, IMO.
>>>
>>>I'm definitely not saying that all these issues need to be resolved
>>before
>>>anything gets committed, but if we can agree that these are valid
>>goals,
>>>then we can evaluate detailed issues based on whether they move us
>>toward
>>>or away from those goals.
>>>
>>>Thanks,
>>>
>>>Steve
>>>
>>>
>>>On Thu, Jul 2, 2015 at 8:34 AM Andreas Hansson
>><***@arm.com>
>>>wrote:
>>>
>>>>Hi all,
>>>>
>>>>I think we need to up-level this a bit. From our perspective (and I
>>>>suspect in general):
>>>>
>>>>1. Robustness is important. Having a design that _may_ break, however
>>>>unlikely is simply not an option.
>>>>
>>>>2. Performance and scaling is important. We can compare actual numbers
>>>>here, and I am fairly sure the two solutions are on par. Let’s
>>quantify
>>>>that though.
>>>>
>>>>3. Checkpointing must not rely on synchronicity. It is vital for
>>several
>>>>workloads that we can checkpoint the various gem5 instances at
>>different
>>>>Ticks (due to the way the workloads are constructed).
>>>>
>>>>Andreas
>>>>
>>>>On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>>>>
>>>>>Thanks Gabor for the reply.
>>>>>
>>>>>I feel this conversation is useful as we can find out pros/cons of
>>each
>>>>>design.
>>>>>Please find my response in-lined below.
>>>>>
>>>>>Thank you,
>>>>>Mohammad
>>>>>
>>>>>On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com>
>>>>wrote:
>>>>>
>>>>>>Hi All,
>>>>>>
>>>>>>Sorry for the missing indentation in my previous e-mail! (This was
>>my
>>>>>>first e-mail to the dev-list so I could not simply use “reply").
>>>>Below
>>>>>>is
>>>>>>the same message, hopefully in more readable form.
>>>>>>
>>>>>>====================================
>>>>>>
>>>>>>Hi All,
>>>>>>
>>>>>>Thank you Mohammad for your elaboration on the issues!
>>>>>>
>>>>>>I have written most of the multi-gem5 patch so let me add some more
>>>>>>clarifications and answer to your concerns. My comments are inline
>>>>>>below.
>>>>>>
>>>>>>Thanks,
>>>>>>- Gabor
>>>>>>
>>>>>>On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
>>>>>>
>>>>>>>Hi All,
>>>>>>>
>>>>>>>Curtis-Thank you for listing some of the differences. I was
>>waiting
>>>>for
>>>>>>>the
>>>>>>>completed multi-gem5 patch before I send my review. Please see my
>>>>>>inline
>>>>>>>response below. I¹ve addressed the concerns that you¹ve raised.
>>>>Also,
>>>>>>I¹ve
>>>>>>>added a bit more to the comparison.
>>>>>>>
>>>>>>>-* Synchronization.
>>>>>>>
>>>>>>>pd-gem5 implements this in Python (not a problem in itself;
>>>>>>aesthetically
>>>>>>>
>>>>>>>this is nice, but...). The issue is that pd-gem5's data packets
>>and
>>>>>>>
>>>>>>>barrier messages travel over different sockets. Since pd-gem5
>>could
>>>>>>see
>>>>>>>
>>>>>>>data packets passing synchronization barriers, it could create an
>>>>>>>
>>>>>>>inconsistent checkpoint.
>>>>>>>
>>>>>>>multi-gem5's synchronization is implemented in C++ using sync
>>>>events,
>>>>>>but
>>>>>>>
>>>>>>>more importantly, the messages queue up in the same stream and so
>>>>>>cannot
>>>>>>>
>>>>>>>have the issue just described. (Event ordering is often crucial
>>in
>>>>>>>
>>>>>>>snapshot protocols.) Therefore we feel that multi-gem5 is a more
>>>>robust
>>>>>>>
>>>>>>>solution in this respect.
>>>>>>>
>>>>>>>Each packet in pd-gem5 has a time-stamp. So even if data packets
>>>>pass
>>>>>>>synchronization barriers (in another word data packets arrive
>>early
>>>>at
>>>>>>the
>>>>>>>destination node), destination node process packets based on their
>>>>>>>timestamp. Actually allowing data packets to pass sync barriers
>>is a
>>>>>>nice
>>>>>>>feature that can reduce the likelihood of late packet reception.
>>>>>>Ordering
>>>>>>>of data messages that flow over pd-gem5 nodes is also preserved in
>>>>>>pd-gem5
>>>>>>>implementation.
>>>>>>
>>>>>>This seems to be a misunderstanding. Maybe the wording was not
>>>>precise
>>>>>>before.The problem is not a data packet that “passing" a sync
>>barrier
>>>>>>but the other way around, a sync barrier that can pass a data
>>packet
>>>>>>(e.g. while the data packet is waiting in the host operating system
>>>>>>socket layer). If that happens, the packet will arrive later than
>>it
>>>>>>was
>>>>>>supposed to and it may miss the computed receive tick.
>>>>>>
>>>>>>For instance, let’s assume that the quantum coincides with the
>>>>simulated
>>>>>>Ether link delay. (This is the optimal choice of quantum to
>>minimize
>>>>the
>>>>>>number of sync barriers.) If a data packet is sent right at the
>>>>>>beginning
>>>>>>of a quantum then this packet must arrive at the destination gem5
>>>>>>process
>>>>>>within the same quantum in order not to miss its receive tick at
>>the
>>>>>>very
>>>>>>beginning of the next quantum. If the sync barrier can pass the
>>data
>>>>>>packet
>>>>>>then the data packet may arrive only during the next quantum (or
>>in
>>>>>>extreme conditions even later than that) so when it arrives the
>>>>receiver
>>>>>>gem5 may pass already the receive tick.
>>>>>>
>>>>>>This argument makes more sense than the previous one. Note that
>>gem5
>>>>is
>>>>>>a
>>>>>cycle accurate simulator and it runs orders of magnitude slower that
>>>>real
>>>>>hardware. So it's almost impossible that the flight time of packet
>>>>through
>>>>>real network turns to be more that simulation time of one quantum. We
>>>>ran
>>>>>a
>>>>>set of experiments just for this purpose: with quantum size equal to
>>>>>etherlink delay, we never got any late arrival violation (what you
>>>>>described) for full NAS benchmarks suit (please refer to the paper).
>>>>>
>>>>>multi-gem5 is optimized for a case that almost never happens! and
>>>>>scarifying speedup for no gain.
>>>>>
>>>>>
>>>>>>Time-stamping does help with this issue. Also, if a data packet is
>>>>>>waiting
>>>>>>in the host operating system socket layer when the simulation
>>thread
>>>>>>exits
>>>>>>to python to complete the next sync barrier then the packet will
>>>>not go
>>>>>>into the checkpoint that may follow that sync barrier.
>>>>>>
>>>>>>That's a good point. Current pd-gem5 checkpointing mechanism might
>>>>miss
>>>>>packets that have been sent during previous quantum and are waiting
>>in
>>>>OS
>>>>>socket buffer. I should add some code inside ethertap serialization
>>>>>function to drain ethertap socket before writing checkpoint. I will
>>>>update
>>>>>pd-gem5 patch accordingly.
>>>>>
>>>>>>
>>>>>>>What you mentioned as an advantage for multi-gem5 is actually a
>>key
>>>>>>>disadvantage: buffering sync messages behind data packets can add
>>>>up to
>>>>>>>the
>>>>>>>synchronization overhead and slow down simulation significantly.
>>>>>>
>>>>>>The purpose of sync messages is to make sure that the data packets
>>>>>>arrive
>>>>>>in time (in terms of simulated time) at the destination so they can
>>>>be
>>>>>>scheduled for being received at the proper computed tick. Sync
>>>>messages
>>>>>>also make sure that no data packets are in flight when a sync
>>barrier
>>>>>>completes before we take a checkpoint. They definitely add
>>overhead
>>>>for
>>>>>>the simulation but they are necessary for the correctness of the
>>>>>>simulation.
>>>>>>
>>>>>>The receive thread in multi-gem5 reads out packets from the socket
>>in
>>>>>>parallel with the simulation thread so packets normally will not be
>>>>>>"queueing up” before a sync barrier message. There is definitely
>>>>room
>>>>>>for improvements in the current implementation for reducing the
>>>>>>synchronization overhead but that is likely true for pd-gem5, too.
>>>>>>The important thing here is that the solution must provide
>>>>correctness
>>>>>>(robustness) first.
>>>>>>
>>>>>>pd-gem5 provides correctness. Please read my previous comment. The
>>>>whole
>>>>>purpose of multi/pd-gem5 is to parallelize simulation with minimal
>>>>>overhead
>>>>>and gain speedup. If you fail to do so, nobody will use your tool.
>>>>>
>>>>>
>>>>>>>Also,
>>>>>>>multi-gem5 send huge sized messages (multiHeaderPkt) through
>>>>network to
>>>>>>>perform each synchronization point, which increases
>>synchronization
>>>>>>>overhead further. In pd-gem5, we choose to send just one character
>>>>as
>>>>>>sync
>>>>>>>message through a separate socket to reduce synchronization
>>>>overhead.
>>>>>>
>>>>>>The TCP/IP message size is unlikely the bottleneck here. Multi-gem5
>>>>will
>>>>>>send ~50 bytes more in a sync barrier message than pd-gem5 but that
>>>>>>bigger
>>>>>>sync message still fits into a single ethernet frame on the wire.
>>The
>>>>>>end-to-end latency overhead that is caused by 50 bytes extra
>>payload
>>>>for
>>>>>>a small single frame TCP/IP message is likely to fall into the
>>>>“noise"
>>>>>>category if one tries to measure it in a real cluster.
>>>>>>
>>>>>>You should prove your hypothesis experimentally. Each gem5 process
>>>>>send/receive sync messages at the end of every quantum. Say you are
>>>>>simulating "N" node computer cluster with "M" different
>>configuration.
>>>>>Then
>>>>>you will have N*M gem5 processes that send/receive these 50 Bytes (it
>>>>>think
>>>>>it's more) extra data at the same time over network ...
>>>>>
>>>>>Furthermore, multi-gem5 send a header before each data message.
>>>>Comparing
>>>>>with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12 least
>>>>>significant digits of the Tick) to each data packet. I don't know
>>>>exactly
>>>>>how large are these "MultiHeaderPkt", but it just has two Tick field
>>>>that
>>>>>each is 64 Bytes! Also, header packets are separate TCP packets, so
>>you
>>>>>pay
>>>>>for sending two separate packets for each data packet. And worst, you
>>>>>serialize all of these with sync messages.
>>>>>
>>>>>
>>>>>>>
>>>>>>>* Packet handling.
>>>>>>>
>>>>>>>pd-gem5 uses EtherTap for data packets but changed the polling
>>>>>>mechanism
>>>>>>>
>>>>>>>to go through the main event queue. Since this rate is actually
>>>>linked
>>>>>>>
>>>>>>>with simulator progress, it cannot guarantee that the packets are
>>>>>>>serviced
>>>>>>>
>>>>>>>at regular intervals of real time. This can lead to packets
>>>>queueing
>>>>>>up
>>>>>>>
>>>>>>>which would contribute to the synchronization issues mentioned
>>>>above.
>>>>>>>
>>>>>>>multi-gem5 uses plain sockets with separate receive threads and so
>>>>does
>>>>>>>not
>>>>>>>
>>>>>>>have this issue.
>>>>>>>
>>>>>>>I think again you are pointing to your first concern that I¹ve
>>>>>>explained
>>>>>>>above. Packets that have queued up in EtherTap socket, will be
>>>>>>processed
>>>>>>>and delivered to simulation environment at the beginning of next
>>>>>>>simulation
>>>>>>>quantum.
>>>>>>>
>>>>>>>Please notice that multi-gem5 introduces a new simObjects to
>>>>interface
>>>>>>>simulation environment to real world which is redundant. This
>>>>>>>functionality
>>>>>>>is already there by EtherTap.
>>>>>>
>>>>>>Except that the EtherTap solution does not provide a correct
>>(robust)
>>>>>>solution for the synchronization problem.
>>>>>>
>>>>>>Please read my first/second comments.
>>>>>
>>>>>
>>>>>>>
>>>>>>>* Checkpoint accuracy.
>>>>>>>
>>>>>>>A user would like to have a checkpoint at precisely the time the
>>>>>>>
>>>>>>>'m5 checkpoint' operation is executed so as to not miss any of the
>>>>>>>
>>>>>>>area of interest in his application.
>>>>>>>
>>>>>>>pd-gem5 requires that simulation finish the current quantum
>>>>>>>
>>>>>>>before checkpointing, so it cannot provide this.
>>>>>>>
>>>>>>>(Shortening the quantum can help, but usually the snapshot is
>>being
>>>>>>taken
>>>>>>>
>>>>>>>while 'fast-forwarding', i.e. simulating as fast as possible,
>>which
>>>>>>would
>>>>>>>
>>>>>>>motivate a longer quantum.)
>>>>>>>
>>>>>>>multi-gem5 can enter the drain cycle immediately upon receiving a
>>>>>>>
>>>>>>>checkpoint request. We find this accuracy highly desirable.
>>>>>>>
>>>>>>>It¹s true that if you have a large quantum size then there would
>>be
>>>>>>some
>>>>>>>discrepancy between the m5_ckpt instruction tick and the actual
>>dump
>>>>>>tick.
>>>>>>>Based on multi-gem5 code, my understanding is that you send async
>>>>>>>checkpoint message as soon as one of the gem5 processes encounter
>>>>>>m5_ckpt
>>>>>>>instruction. But I¹m not sure how you fix the aforementioned
>>issue,
>>>>>>>because
>>>>>>>you have to sync all gem5 processes before you start dumping
>>>>>>checkpoint,
>>>>>>>which necessitate a global synchronization beforehand.
>>>>>>
>>>>>>In multi-gem5, the gem5 process who encounters the m5_ckpt
>>>>instruction
>>>>>>sends out an async checkpoint notification for the peer gem5
>>>>processes
>>>>>>and
>>>>>>then it starts the draining immediately (at the same tick). So the
>>>>>>checkpoint will be taken at the exact tick form the initiator
>>process
>>>>>>point of view. The global synchronisation with the peer processes
>>>>takes
>>>>>>place while the initiator process is still waiting at the same tick
>>>>(i.e
>>>>>>the simulation thread is suspended). However, the receiver thread
>>>>>>Continues reading out the socket - while waiting for the global
>>sync
>>>>to
>>>>>>complete- to make sure that in-flight data packets from peer gem5
>>>>>>processes
>>>>>>are stored properly and saved into the checkpoint.
>>>>>>
>>>>>>
>>>>>So you mean multi-gem5 ends up with having gem5 processes with
>>>>different
>>>>>ticks after checkpoint? In pd-gem5 we make sure that all gem5
>>processes
>>>>>start dumping checkpoint at the same tick. Are you sure that this is
>>>>>correct to have each gem5 process dump checkpoint at different
>>ticks???
>>>>>
>>>>>I don't think this a correct checkpointing design. However, if you
>>>>feel it
>>>>>is correct, I can change a couple of lines in "Simulation.py" and
>>>>barrier
>>>>>scripts to implement the same functionality in pd-gem5. One thing
>>that
>>>>you
>>>>>are obsessed about is to make sure that there is no in-flight packets
>>>>>while
>>>>>we start dumping checkpoint, and you have all these complex
>>mechanisms
>>>>in
>>>>>place to insure that! I think you can 99.99999% make sure that there
>>>>is no
>>>>>in-flight packet by waiting for 1 second after all gem5 processes
>>>>finished
>>>>>their quantum simulation and then dump checkpoint. Do you really
>>think
>>>>>that
>>>>>delivering a tcp packet would take more than 1 second in today's
>>>>systems!?
>>>>>Always go for simple solutions ...
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>>By the way, we have a fix for this issue by introducing a new m5
>>>>pseudo
>>>>>>>instruction.
>>>>>>
>>>>>>I fail to see how a new pseudo instruction can solve the problem of
>>>>>>completing the full quantum in pd-gem5 before a checkpoint can be
>>>>taken.
>>>>>>Could you please elaborate on that?
>>>>>>
>>>>>>As we take checkpoint while fast-forwarding and it is likely that
>>we
>>>>>>relax
>>>>>synchronization for speedup purpose, a new pseudo instruction that
>>can
>>>>set
>>>>>quantum size (m5_qset) can be helpful. So, one can insert m5_qset in
>>>>his
>>>>>benchmark source code before entering ROI that contains m5_ckpt to
>>>>>decrease
>>>>>quantum size beforehand and reduce the discrepancy between m5_ckpt
>>tick
>>>>>and
>>>>>actual checkpoint tick. This is not included in pd-gem5 patch right
>>>>now.
>>>>>
>>>>>
>>>>>>>
>>>>>>>* Implementation of network topology.
>>>>>>>
>>>>>>>pd-gem5 uses a separate gem5 process to act as a switch whereas
>>>>>>multi-gem5
>>>>>>>
>>>>>>>uses a standalone packet relay process.
>>>>>>>
>>>>>>>We haven't measured the overhead of pd-gem5's simulated switch
>>yet,
>>>>but
>>>>>>>
>>>>>>>we're confident that our approach is at least as fast and more
>>>>>>scalable.
>>>>>>>
>>>>>>>There is this flexibility in pd-gem5 to simulate a switch box
>>>>alongside
>>>>>>>one
>>>>>>>of the other gem5 processes. However, it might make that gem5
>>>>process
>>>>>>the
>>>>>>>simulation bottleneck. One of the advantages of pd-gem5 over
>>>>>>multi-gem5 is
>>>>>>>that we use gem5 to simulate a switch box, which allows us to
>>model
>>>>any
>>>>>>>network topology by instantiating several Switch simObjects and
>>>>>>>interconnect them with EhterLink in an arbitrary fashion. A
>>>>standalone
>>>>>>tcp
>>>>>>>server just can provide switch functionality (forwarding packets
>>to
>>>>>>>destinations) and model a star network topology. Furthermore, it
>>>>cannot
>>>>>>>model various network timings such as queueing delay, congestion,
>>>>and
>>>>>>>routing latency. Also it has some accuracy issues that I will
>>point
>>>>out
>>>>>>>next.
>>>>>>
>>>>>>I agree with the complex topology argument. We already mentioned
>>that
>>>>>>before as an advantage for pd-gem5 from the point of view of future
>>>>>>extensions. However, I do not agree that multi-gem5 cannot model
>>>>>>queueing
>>>>>>delays and congestions. For a simple crossbar switch, it can model
>>>>>>queueing
>>>>>>delays and congestions, but the receive queues are distributed
>>among
>>>>the
>>>>>>gem5 processes.
>>>>>>
>>>>>>It's true that you can model queuing delay of a simple crossbar by
>>>>>distributing queues across gem5 processes (end points). But to be
>>able
>>>>to
>>>>>do so you have to ensure the ordering of packets that you enqueue in
>>>>the
>>>>>distributed queues. It is almost impossible without a synchronized
>>>>switch
>>>>>box. You should have a reorder queue that reorders packets
>>dynamically
>>>>and
>>>>>updates timing parameter for each packet as well. I don't know how
>>much
>>>>>progress have you had to ensure ordering scheme in multi-gem5 but you
>>>>may
>>>>>already realized that how complex and error prone it can be. This
>>>>argument
>>>>>is also related to my next argument for "Broken network timing".
>>>>>
>>>>>
>>>>>>>
>>>>>>>* Broken network timing:
>>>>>>>
>>>>>>>Forwarding packets between gem5 processes using a standalone tcp
>>>>server
>>>>>>>can
>>>>>>>cause reordering between packets that have different source but
>>same
>>>>>>>destination. It causes inaccurate network timing and worse of all
>>>>>>>non-deterministic simulation. pd-gem5 resolve this by reordering
>>>>>>packets
>>>>>>>at
>>>>>>>Switch process and then send them to their destination (it¹s
>>>>possible
>>>>>>as
>>>>>>>switch is synchronized with the rest of the nodes).
>>>>>>
>>>>>>In multi-gem5, there is always a HeaderPkt that contains some meta
>>>>>>information for each data packet. The meta information include the
>>>>send
>>>>>>tick and the sender rank (i.e. a unique ID of the sender gem5
>>>>process).
>>>>>>We use those information to define a well defined ordering of
>>packets
>>>>>>even
>>>>>>if packets are arriving at the same receiver from different
>>senders.
>>>>>>This
>>>>>>packet ordering scheme is still being tested so the corresponding
>>>>patch
>>>>>>is
>>>>>>not on the RB yet.
>>>>>>
>>>>>>Please read my previous comment. The most important part of
>>>>>>multi/pd-gem5
>>>>>extension is ensuring accurate and deterministic simulation.
>>>>>
>>>>>
>>>>>>>
>>>>>>>* Amount of changes
>>>>>>>
>>>>>>>pd-gem5 introduce different modes in etherlink just to provide
>>>>accurate
>>>>>>>timing for each component in the network subsystem (NIC, link,
>>>>switch)
>>>>>>as
>>>>>>>well as capability of modeling different network topologies (mesh,
>>>>>>ring,
>>>>>>>fat tree, etc). To enable a simple functionality, like what
>>>>multi-gem5
>>>>>>>provides, the amount of changes in gem5 can be limited to
>>>>time-stamping
>>>>>>>packets and providing synchronization through python scripts.
>>>>However,
>>>>>>>multi-gem5 re-implements functionalists that are already in gem5.
>>>>>>
>>>>>>This argument holds only if both implementations are correct
>>>>(robust).
>>>>>>It
>>>>>>still seems to me that pd-gem5 does not provide correctness for the
>>>>>>synchronization/checkpointing parts.
>>>>>>
>>>>>>Again, please read my first comment for correctness of pd-gem5.
>>>>>
>>>>>
>>>>>>>
>>>>>>>* Integrating with gem5 mainstream:
>>>>>>>
>>>>>>>pd-gem5 launch script is written in python which is suited for
>>>>>>integration
>>>>>>>with gem5 python scripts. However multi-gem5 uses bash script.
>>Also,
>>>>>>all
>>>>>>>source files in pd-gem5 are already parts of gem5 mainstream.
>>>>However
>>>>>>>multi-gem5 has tcp_server.cc/hh that is a standalone process and
>>>>cannot
>>>>>>be
>>>>>>>part of gem5.
>>>>>>
>>>>>>The multi-gem5 launch script is simply enough to rely only on the
>>>>>>shell. It
>>>>>>can obviously be easily re-written in python if that added any
>>value.
>>>>>>The
>>>>>>tcp_server component is only a utility (like the "m5" utility that
>>is
>>>>>>also
>>>>>>part of gem5).
>>>>>>
>>>>>>The thing is that it's more likely that users want to add some
>>>>>functionality to the run-script of multi/pd-gem5. E.g. pd-gem5
>>>>run-script
>>>>>supports launching simulations using a simulation pool management
>>>>>software (
>>>>>http://research.cs.wisc.edu/htcondor/). Using python enables users to
>>>>>easily add these kind of supports.
>>>>>
>>>>>
>>>>>>
>>>>>>Cheers,
>>>>>>- Gabor
>>>>>>
>>>>>>
>>>>>>>On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham
>>>><***@arm.com>
>>>>>>>wrote:
>>>>>>>
>>>>>>>>Hello everyone,
>>>>>>>>We have taken a look at how pd-gem5 compares with multi-gem5.
>>>>While
>>>>>>>>intending
>>>>>>>>to deliver the same functionality, there are some crucial
>>>>differences:
>>>>>>>>
>>>>>>>>* Synchronization.
>>>>>>>>
>>>>>>>> pd-gem5 implements this in Python (not a problem in itself;
>>>>>>>>aesthetically
>>>>>>>> this is nice, but...). The issue is that pd-gem5's data
>>>>packets
>>>>>>and
>>>>>>>> barrier messages travel over different sockets. Since
>>pd-gem5
>>>>>>could
>>>>>>>>see
>>>>>>>> data packets passing synchronization barriers, it could
>>create
>>>>an
>>>>>>>> inconsistent checkpoint.
>>>>>>>>
>>>>>>>> multi-gem5's synchronization is implemented in C++ using sync
>>>>>>events,
>>>>>>>>but
>>>>>>>> more importantly, the messages queue up in the same stream
>>and
>>>>so
>>>>>>>>cannot
>>>>>>>> have the issue just described. (Event ordering is often
>>>>crucial
>>>>>>in
>>>>>>>> snapshot protocols.) Therefore we feel that multi-gem5 is a
>>>>more
>>>>>>>>robust
>>>>>>>> solution in this respect.
>>>>>>>>
>>>>>>>>* Packet handling.
>>>>>>>>
>>>>>>>> pd-gem5 uses EtherTap for data packets but changed the
>>polling
>>>>>>>>mechanism
>>>>>>>> to go through the main event queue. Since this rate is
>>>>actually
>>>>>>>>linked
>>>>>>>> with simulator progress, it cannot guarantee that the packets
>>>>are
>>>>>>>>serviced
>>>>>>>> at regular intervals of real time. This can lead to packets
>>>>>>>>queueing up
>>>>>>>> which would contribute to the synchronization issues
>>mentioned
>>>>>>above.
>>>>>>>>
>>>>>>>> multi-gem5 uses plain sockets with separate receive threads
>>>>and so
>>>>>>>>does
>>>>>>>>not
>>>>>>>> have this issue.
>>>>>>>>
>>>>>>>>* Checkpoint accuracy.
>>>>>>>>
>>>>>>>> A user would like to have a checkpoint at precisely the time
>>the
>>>>>>>> 'm5 checkpoint' operation is executed so as to not miss any of
>>>>the
>>>>>>>> area of interest in his application.
>>>>>>>>
>>>>>>>> pd-gem5 requires that simulation finish the current quantum
>>>>>>>> before checkpointing, so it cannot provide this.
>>>>>>>>
>>>>>>>> (Shortening the quantum can help, but usually the snapshot is
>>>>being
>>>>>>>>taken
>>>>>>>> while 'fast-forwarding', i.e. simulating as fast as possible,
>>>>which
>>>>>>>>would
>>>>>>>> motivate a longer quantum.)
>>>>>>>>
>>>>>>>> multi-gem5 can enter the drain cycle immediately upon
>>receiving
>>>>a
>>>>>>>> checkpoint request. We find this accuracy highly desirable.
>>>>>>>>
>>>>>>>>* Implementation of network topology.
>>>>>>>>
>>>>>>>> pd-gem5 uses a separate gem5 process to act as a switch
>>whereas
>>>>>>>>multi-gem5
>>>>>>>> uses a standalone packet relay process.
>>>>>>>>
>>>>>>>> We haven't measured the overhead of pd-gem5's simulated switch
>>>>yet,
>>>>>>>>but
>>>>>>>> we're confident that our approach is at least as fast and more
>>>>>>>>scalable.
>>>>>>>>
>>>>>>>>
>>>>>>>>Thanks,
>>>>>>>>Curtis
>>>>>>>>________________________________________
>>>>>>>>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad
>>>>>>Alian [
>>>>>>>>***@wisc.edu]
>>>>>>>>Sent: Friday, June 26, 2015 7:37 PM
>>>>>>>>To: gem5 Developer List
>>>>>>>>Subject: Re: [gem5-dev] pd-gem5: simulating a
>>parallel/distributed
>>>>>>>>system
>>>>>>>>on multiple physical hosts
>>>>>>>>
>>>>>>>>Hi Anthony,
>>>>>>>>
>>>>>>>>I think that would be a good option, then I can add pd-gem5
>>>>>>>>functionality
>>>>>>>>on top of that. Right now I've simplified your implementation.
>>>>Also, I
>>>>>>>>think I had found some bugs in your patch that I cannot remember
>>>>now.
>>>>>>If
>>>>>>>>you decided to ship EtherSwitch patch, let me know to give you a
>>>>>>review
>>>>>>>>on
>>>>>>>>that.
>>>>>>>>
>>>>>>>>Thanks,
>>>>>>>>Mohammad
>>>>>>>>
>>>>>>>>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
>>>>>>>>***@amd.com> wrote:
>>>>>>>>
>>>>>>>>>Would it make sense for me to ship the EtherSwitch patch first,
>>>>since
>>>>>>>>it
>>>>>>>>>has utility on its own, and then we can decide which of the
>>>>>>>>"multi-gem5"
>>>>>>>>>approaches is best, or if it's some combination of both?
>>>>>>>>>
>>>>>>>>>The only reason I never shipped it was because Steve raised an
>>>>issue
>>>>>>>>that
>>>>>>>>>I didn't have a good alternative for, and didn't have the time
>>to
>>>>>>look
>>>>>>>>into
>>>>>>>>>one at that time.
>>>>>>>>>________________________________________
>>>>>>>>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
>>>>>>>>Alian [
>>>>>>>>>***@wisc.edu]
>>>>>>>>>Sent: Wednesday, June 24, 2015 12:43 PM
>>>>>>>>>To: gem5 Developer List
>>>>>>>>>Subject: Re: [gem5-dev] pd-gem5: simulating a
>>parallel/distributed
>>>>>>>>system
>>>>>>>>>on multiple physical hosts
>>>>>>>>>
>>>>>>>>>Hi Andreas,
>>>>>>>>>
>>>>>>>>>Thanks for the comment.
>>>>>>>>>I think the checkpointing support in both works is the same.
>>Here
>>>>is
>>>>>>>>how
>>>>>>>>>checkpointing support is implemented in pd-gem5:
>>>>>>>>>
>>>>>>>>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
>>>>>>>>>instruction, it will send a ³recv-ckpt² signal to the
>>>>>>>>>³barrier² process. Then the ³barrier² process sends a
>>³take-ckpt²
>>>>>>>>signal
>>>>>>>>to
>>>>>>>>>all the simulated nodes
>>>>>>>>>(including the node that encountered m5-checkpoint) at the end
>>of
>>>>the
>>>>>>>>>current simulation quantum. On the reception of
>>>>>>>>>³take-ckpt² signal, gem5 processes start dumping check-points.
>>>>This
>>>>>>>>makes
>>>>>>>>>each simulated node dump a checkpoint
>>>>>>>>>at the same simulated time point while ensuring there is no
>>>>in-flight
>>>>>>>>>packets.
>>>>>>>>>
>>>>>>>>>I believe this is the same as multi-gem5 patch approach for
>>>>>>checkpoint
>>>>>>>>>support (based on the commit message of
>>>>>>>>http://reviews.gem5.org/r/2865/
>>>>>>>>).
>>>>>>>>>Also, we have tested our mechanism with several benchmarks and
>>it
>>>>>>>>works.
>>>>>>>>As
>>>>>>>>>Steve suggested, I'll look into Curtis's patch and try to review
>>>>it
>>>>>>as
>>>>>>>>>well.
>>>>>>>>>But as Nilay also mentioned earlier, there are some codes
>>missing
>>>>in
>>>>>>>>>Curtis's patch. I prefer to first run multi-gem5 before starting
>>>>to
>>>>>>>>review
>>>>>>>>>it.
>>>>>>>>>
>>>>>>>>>Thank you,
>>>>>>>>>Mohammad
>>>>>>>>>
>>>>>>>>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
>>>>>>>>***@arm.com>
>>>>>>>>>wrote:
>>>>>>>>>
>>>>>>>>>>Hi Steve,
>>>>>>>>>>
>>>>>>>>>>Apologies for the confusion. We are on the same page. My point
>>is
>>>>>>>>that
>>>>>>>>we
>>>>>>>>>>cannot simply take a little bit of patch A and a little bit of
>>>>>>>>patch B.
>>>>>>>>>>This change involves a lot of code, and we need to approach
>>this
>>>>in
>>>>>>>>a
>>>>>>>>>>structured fashion. My proposal is to do it bottom up, and
>>start
>>>>by
>>>>>>>>>>getting the basic support in place. Since
>>>>>>>>>http://reviews.gem5.org/r/2826/
>>>>>>>>>>has already been on the review board for a few months, I am
>>>>merely
>>>>>>>>>>suggesting that the it would be a good start to relate the
>>newly
>>>>>>>>posted
>>>>>>>>>>patches to what is already there.
>>>>>>>>>>
>>>>>>>>>>Andreas
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
>>>>>>>>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com>
>>wrote:
>>>>>>>>>>
>>>>>>>>>>>Hi Andreas,
>>>>>>>>>>>
>>>>>>>>>>>I'm a little confused by your email---you say you're
>>>>fundamentally
>>>>>>>>>opposed
>>>>>>>>>>>to looking at both patches and picking the best features, then
>>>>you
>>>>>>>>point
>>>>>>>>>>>out that the patches Curtis posted have the feature of better
>>>>>>>>>>>checkpointing
>>>>>>>>>>>support so we should pick that :).
>>>>>>>>>>>
>>>>>>>>>>>Obviously we can't just pick patch A from Mohammad's set and
>>>>patch
>>>>>>>>B
>>>>>>>>>from
>>>>>>>>>>>Curtis's set and expect them to work together, but I think
>>that
>>>>>>>>having
>>>>>>>>>>>both
>>>>>>>>>>>sets of patches available and comparing and contrasting the
>>two
>>>>>>>>>>>implementations should enable us to get to a single
>>>>implementation
>>>>>>>>>that's
>>>>>>>>>>>the best of both. Someone will have to make the effort of
>>>>>>>>integrating
>>>>>>>>>the
>>>>>>>>>>>better ideas from one set into the other set to create a new
>>>>>>>>unified
>>>>>>>>set
>>>>>>>>>>>of
>>>>>>>>>>>patches; (or maybe we commit one set and then integrate the
>>>>best of
>>>>>>>>the
>>>>>>>>>>>other set as patches on top of that), but the first step is to
>>>>>>>>identify
>>>>>>>>>>>what "the best of both" is. Having Mohammad look at Curtis's
>>>>>>>>patches,
>>>>>>>>>and
>>>>>>>>>>>Curtis (or someone else from ARM) closely examine Mohammad's
>>>>>>>>patches
>>>>>>>>>would
>>>>>>>>>>>be a great start. I intend to review them both, though
>>>>>>>>unfortunately
>>>>>>>>my
>>>>>>>>>>>time has been scarce lately---I'm hoping to squeeze that in
>>>>later
>>>>>>>>this
>>>>>>>>>>>week.
>>>>>>>>>>>
>>>>>>>>>>>Once we've had a few people look at both, we can discuss the
>>>>pros
>>>>>>>>and
>>>>>>>>>cons
>>>>>>>>>>>of each, then discuss the strategy for getting the best
>>features
>>>>>>>>in.
>>>>>>>>So
>>>>>>>>>>>far I've heard that Mohammad's patches have a better network
>>>>model
>>>>>>>>but
>>>>>>>>>the
>>>>>>>>>>>ARM patches have better checkpointing support; that seems
>>like a
>>>>>>>>good
>>>>>>>>>>>start.
>>>>>>>>>>>
>>>>>>>>>>>Steve
>>>>>>>>>>>
>>>>>>>>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
>>>>>>>>>***@arm.com
>>>>>>>>>>>
>>>>>>>>>>>wrote:
>>>>>>>>>>>
>>>>>>>>>>>>Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>>Great work. However, I fundamentally do not believe in the
>>>>>>>>approach
>>>>>>>>of
>>>>>>>>>>>>Œletting reviewers pick the best features¹. There is no way
>>we
>>>>>>>>would
>>>>>>>>>>>>ever
>>>>>>>>>>>>get something working out if it. We need to get _one_ working
>>>>>>>>solution
>>>>>>>>>>>>here, and figure out how to best get there. I would propose
>>to
>>>>>>>>do it
>>>>>>>>>>>>bottom up, starting with the basic multi-simulator instance
>>>>>>>>support,
>>>>>>>>>>>>checkpointing support, and then move on to the network
>>between
>>>>>>>>the
>>>>>>>>>>>>simulator instances.
>>>>>>>>>>>>
>>>>>>>>>>>>Thus, I propose we go with the low-level plumbing and
>>>>checkpoint
>>>>>>>>>support
>>>>>>>>>>>>from what Curtis has posted. I believe proper checkpointing
>>>>>>>>support
>>>>>>>>to
>>>>>>>>>>>>be
>>>>>>>>>>>>the most challenging, and from what I can tell this is far
>>more
>>>>>>>>>limited
>>>>>>>>>>>>in
>>>>>>>>>>>>what you just posted Mohammad. Could you perhaps review
>>Curtis
>>>>>>>>patches
>>>>>>>>>>>>based on your insights, and we can try and get these patches
>>in
>>>>>>>>shape
>>>>>>>>>>>>and
>>>>>>>>>>>>committed asap.
>>>>>>>>>>>>
>>>>>>>>>>>>Once we have the baseline functionality in place, then we can
>>>>>>>>start
>>>>>>>>>>>>looking at the more elaborate network models.
>>>>>>>>>>>>
>>>>>>>>>>>>Does this sound reasonable?
>>>>>>>>>>>>
>>>>>>>>>>>>Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>>Andreas
>>>>>>>>>>>>
>>>>>>>>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
>>>>>>>>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu>
>>wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>Hello All,
>>>>>>>>>>>>>
>>>>>>>>>>>>>I have submitted a chain of patches which enables gem5 to
>>>>>>>>simulate
>>>>>>>>a
>>>>>>>>>>>>>cluster on multiple physical hosts:
>>>>>>>>>>>>>
>>>>>>>>>>>>>http://reviews.gem5.org/r/2909/
>>>>>>>>>>>>>http://reviews.gem5.org/r/2910/
>>>>>>>>>>>>>http://reviews.gem5.org/r/2912/
>>>>>>>>>>>>>http://reviews.gem5.org/r/2913/
>>>>>>>>>>>>>http://reviews.gem5.org/r/2914/
>>>>>>>><http://reviews.gem5.org/r/2914/>
>>>>>>>>>>>>>
>>>>>>>>>>>>>and a patch that contains run scripts for a simple
>>experiment:
>>>>>>>>>>>>>http://reviews.gem5.org/r/2915/
>>>>>>>>>>>>>
>>>>>>>>>>>>>We have run several benchmarks using this infrastructure,
>>>>>>>>including
>>>>>>>>>NAS
>>>>>>>>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
>>>>>>>>>>>>>(http://prof.ict.ac.cn/DCBench/),
>>>>>>>>>>>>>and would be happy to share scripts/diskimages.
>>>>>>>>>>>>>
>>>>>>>>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or
>>>>less
>>>>>>>>the
>>>>>>>>>>>>same
>>>>>>>>>>>>>as
>>>>>>>>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
>>>>>>>>*network
>>>>>>>>>>>>model
>>>>>>>>>>>>>is
>>>>>>>>>>>>>more thorough; it also enables modeling different network
>>>>>>>>topologies.
>>>>>>>>>>>>>Having both set of changes together let reviewers to pick
>>best
>>>>>>>>>features
>>>>>>>>>>>>>from both works.
>>>>>>>>>>>>>
>>>>>>>>>>>>>Thank you,
>>>>>>>>>>>>>Mohammad Alian
>>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>>gem5-dev mailing list
>>>>>>>>>>>>>gem5-***@gem5.org
>>>>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>>>>>>>>attachments
>>>>>>>>>are
>>>>>>>>>>>>confidential and may also be privileged. If you are not the
>>>>>>>>intended
>>>>>>>>>>>>recipient, please notify the sender immediately and do not
>>>>>>>>disclose
>>>>>>>>>the
>>>>>>>>>>>>contents to any other person, use it for any purpose, or
>>store
>>>>or
>>>>>>>>copy
>>>>>>>>>>>>the
>>>>>>>>>>>>information in any medium. Thank you.
>>>>>>>>>>>>
>>>>>>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge
>>CB1
>>>>>>>>9NJ,
>>>>>>>>>>>>Registered in England & Wales, Company No: 2557590
>>>>>>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
>>>>Cambridge
>>>>>>>>CB1
>>>>>>>>>>>>9NJ,
>>>>>>>>>>>>Registered in England & Wales, Company No: 2548782
>>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>>gem5-dev mailing list
>>>>>>>>>>>>gem5-***@gem5.org
>>>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>>>
>>>>>>>>>>>_______________________________________________
>>>>>>>>>>>gem5-dev mailing list
>>>>>>>>>>>gem5-***@gem5.org
>>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>>>>attachments
>>>>>>>>are
>>>>>>>>>>confidential and may also be privileged. If you are not the
>>>>intended
>>>>>>>>>>recipient, please notify the sender immediately and do not
>>>>disclose
>>>>>>>>the
>>>>>>>>>>contents to any other person, use it for any purpose, or store
>>or
>>>>>>>>copy
>>>>>>>>>the
>>>>>>>>>>information in any medium. Thank you.
>>>>>>>>>>
>>>>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>>>>9NJ,
>>>>>>>>>>Registered in England & Wales, Company No: 2557590
>>>>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
>>Cambridge
>>>>CB1
>>>>>>>>9NJ,
>>>>>>>>>>Registered in England & Wales, Company No: 2548782
>>>>>>>>>>_______________________________________________
>>>>>>>>>>gem5-dev mailing list
>>>>>>>>>>gem5-***@gem5.org
>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>>
>>>>>>>>>_______________________________________________
>>>>>>>>>gem5-dev mailing list
>>>>>>>>>gem5-***@gem5.org
>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>_______________________________________________
>>>>>>>>>gem5-dev mailing list
>>>>>>>>>gem5-***@gem5.org
>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>>
>>>>>>>>_______________________________________________
>>>>>>>>gem5-dev mailing list
>>>>>>>>gem5-***@gem5.org
>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>
>>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>>attachments
>>>>>>are
>>>>>>>>confidential and may also be privileged. If you are not the
>>>>intended
>>>>>>>>recipient, please notify the sender immediately and do not
>>disclose
>>>>>>the
>>>>>>>>contents to any other person, use it for any purpose, or store or
>>>>copy
>>>>>>>>the
>>>>>>>>information in any medium. Thank you.
>>>>>>>>
>>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>>>>9NJ,
>>>>>>>>Registered in England & Wales, Company No: 2557590
>>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>>>>CB1
>>>>>>>>9NJ,
>>>>>>>>Registered in England & Wales, Company No: 2548782
>>>>>>>>
>>>>>>>>_______________________________________________
>>>>>>>>gem5-dev mailing list
>>>>>>>>gem5-***@gem5.org
>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>>>
>>>>>>>_______________________________________________
>>>>>>>gem5-dev mailing list
>>>>>>>gem5-***@gem5.org
>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
>>>>are
>>>>>>confidential and may also be privileged. If you are not the
>>intended
>>>>>>recipient, please notify the sender immediately and do not disclose
>>>>the
>>>>>>contents to any other person, use it for any purpose, or store or
>>>>copy
>>>>>>the
>>>>>>information in any medium. Thank you.
>>>>>>
>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>>>>>>Registered in England & Wales, Company No: 2557590
>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>>CB1
>>>>>>9NJ,
>>>>>>Registered in England & Wales, Company No: 2548782
>>>>>>_______________________________________________
>>>>>>gem5-dev mailing list
>>>>>>gem5-***@gem5.org
>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>>>
>>>>>_______________________________________________
>>>>>gem5-dev mailing list
>>>>>gem5-***@gem5.org
>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>>>
>>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
>>are
>>>>confidential and may also be privileged. If you are not the intended
>>>>recipient, please notify the sender immediately and do not disclose
>>the
>>>>contents to any other person, use it for any purpose, or store or copy
>>>>the
>>>>information in any medium. Thank you.
>>>>
>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>>>>Registered in England & Wales, Company No: 2557590
>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>>>9NJ,
>>>>Registered in England & Wales, Company No: 2548782
>>>>_______________________________________________
>>>>gem5-dev mailing list
>>>>gem5-***@gem5.org
>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>>>
>>>_______________________________________________
>>>gem5-dev mailing list
>>>gem5-***@gem5.org
>>>http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>
>>
>>
>>-- IMPORTANT NOTICE: The contents of this email and any attachments are
>>confidential and may also be privileged. If you are not the intended
>>recipient, please notify the sender immediately and do not disclose the
>>contents to any other person, use it for any purpose, or store or copy
>>the
>>information in any medium. Thank you.
>>
>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>>Registered in England & Wales, Company No: 2557590
>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>>Registered in England & Wales, Company No: 2548782
>>_______________________________________________
>>gem5-dev mailing list
>>gem5-***@gem5.org
>>http://m5sim.org/mailman/listinfo/gem5-dev
>>
>_______________________________________________
>gem5-dev mailing list
>gem5-***@gem5.org
>http://m5sim.org/mailman/listinfo/gem5-dev






-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Mohammad Alian
2015-07-07 05:29:24 UTC
Permalink
Gabor- My concern about unsync checkpoint is that when you restore from an
unsync checkpoint, you'll have gem5 processes that each is running in
different tick. Then how do you handle accurate delivery of packets between
these gem5 processes? It will also make it harder to integrate
multi/pd-gem5 with current multi-threaded gem5. The problem with sync
checkpoint is that you cannot exactly take checkpoint at ROI, but I think
unsync checkpoint introduces some other problems. Considering the necessary
warmup period before starting stat collection, I think we don't need to
exactly pinpoint the ROI. Please correct me if I'm wrong.

I'm trying to run a multi-threaded experiment with pd-gem5, but I got an
error when I tried to partition dual mode simulation on two threads. I
posted that in gem5 users mailing list. Please help me on that if you can.

Thank you,
Mohammad

On Mon, Jul 6, 2015 at 11:45 AM, Gabor Dozsa <***@arm.com> wrote:

> Thank you Steve for the detailed elaboration on the issues.
>
>
> Regarding the “unsynchronized checkpoints”, the terminology might be a bit
> confusing. In fact, we always need to do a global synchronization among
> the gem5 processes before taking a distributed checkpoint (in order to
> avoid in-flight packets). The global synchronization here means that each
> gem5 has to suspend the simulation and wait until every in-flight packets
> arrives (and is stored) at the destination gem5 process. If that global
> synchronization step happens at the same simulated tick in each gem5 then
> the we call the checkpoint “synchronous” otherwise it is an “asynchronous”
> checkpoint.
>
> In the MPI application example I mentioned before the checkpoint should be
> triggered as soon as the “slowest” MPI process reaches the MPI_barrier().
> The problem is that the “slowest” MPI process usually does not reach the
> MPI_barrier() right at the end of the current quantum. If we let the
> simulation continue until the quantum completes (to ensure that the
> checkpoint is taken at the same simulated tick in each gem5) then the MPI
> processes will complete the MPI_barrier and start executing the ROI code
> already.
>
> Regarding the integration of multi-threaded/multi-host simulation,
> multi-gem5 does not support fine grain simulation of hierarchical switches
> (or any other network topologies except a single crossbar) or multiple
> synchronization domains currently.
>
> However, I'm a bit confused about your statement that you don’t see value
> in ever building a shared-memory transport for MultiIface. MultiIface in
> my view is just an abstract interface for “multi-(ether)-link" objects
> which are link objects for connecting multiple (i.e. more than two)
> systems. It aims to encapsulate the API necessary for any Link object
> in a any multi-system configuration - provided that we partition the
> systems across network links during run time.
>
> An orthogonal issue is if we want to include a simple crossbar switch
> model in a MultiIface implementation or we want to provide a ‘standalone'
> fine grain model for the switch (e.g. the pd-gem5 approach).
>
> Thanks,
> - Gabor
>
>
>
> On 7/3/15, 7:33 PM, "Steve Reinhardt" <***@gmail.com> wrote:
>
> >Thanks Mohammad & Gabor for the responses.
> >
> >I think there's still some misunderstanding on what I mean by the
> >integration of multi-threaded and multi-host simulation based on Gabor's
> >response above and Andreas's response in the other thread.
> >
> >The primary example scenario I'm proposing is as Mohammad described:
> >within
> >each host node, we're simulating an entire rack + top-of-rack switch in a
> >single gem5 process, with separate event queues/threads being used to
> >parallelize across nodes within the rack. The switch may or may not be on
> >its own thread as well. The synchronization among the threads only needs
> >to be at the granularity of the intra-rack network latency.
> >
> >Now we want to expand this by using pd-gem5 or multi-gem5 to parallelize
> >multiple of these rack-level simulations across hosts, so we can simulate
> >a
> >whole row of a datacenter. Only the uplinks from the TOR switches would
> >need to go over sockets between processes, and the switch being modeled by
> >pd-gem5 or multi-gem5 would be the end-of-row switch. The synchronization
> >delay among the multiple gem5 processes would be based on the inter-rack
> >latency.
> >
> >So the basic question is: Is this feasible with pd-gem5 / multi-gem5, and
> >if not, how much work would it take to make it so?
> >
> >However, my larger point is that I still don't see value in ever building
> >a
> >shared-memory transport for MultiIface. For this model, there is clearly
> >no
> >need for it. Things get more complicated if we want to do something like
> >have N nodes connected to a single switch and split that over two hosts
> >(with N/2 nodes simulated on each), but even in that case, I think it's a
> >better idea to make the switch model deal with having half of its links
> >internal and half external (since we already want the same model to work
> >in
> >both the all-internal and all-external cases). Not that I'm worried that
> >someone is about to go off and build this shared-memory transport, but I
> >think it's important to reach an understanding here, since it's
> >fundamental
> >to defining the strategic relationship between these capabilities going
> >forward.
> >
> >Stepping back a little further, it would be nice to have a model that is
> >as
> >generic as the multi-threading model, where it's really just a matter of
> >taking a simulation, partitioning the components among the threads, and
> >setting the synchronization quantum, and it works. Of course, even with
> >the
> >multi-threaded model, if you don't choose your partitioning and your
> >quantum wisely, you're not going to get much speedup or a deterministic
> >simulation, but the fundamental implementation is oblivious to that. I'm
> >not saying we really need to go all the way to this extreme---it's pretty
> >reasonable to assume that no one in the near future will want to partition
> >across hosts anywhere other than on a simulated network link---but I think
> >we should keep this ideal in mind as a guiding principle as we choose how
> >to go forward from here.
> >
> >This ties in to my point #4, which is that if we're really building a
> >mechanism to partition a simulation across multiple hosts, then you should
> >be able to run the same simulation in a single gem5 process and get the
> >same results. I think this is the strength of pd-gem5; correspondingly the
> >main weakness of multi-gem5 is that it architecturally feels more like
> >tying together a set of mostly independent gem5 simulations than like
> >partitioning a single gem5 simulation. (Of course, they both end up at
> >roughly the same point in the middle.)
> >
> >On the flip side, multi-gem5 has some clear advantages in terms of the
> >better separation of the communication layer (and I can imagine it being
> >very useful to port to MPI and perhaps some RDMA API for InfiniBand
> >clusters). Also I think the integrated sockets for communication and
> >syncrhonization are the superior design; while the separate sockets used
> >by
> >pd-gem5 may only very rarely cause problems, I agree with Andreas that
> >that's not good enough, and I don't see any real advantage either---if you
> >have to flush the data sockets (or wait for them to drain) before
> >synchronizing, then you might as well just have the synchronization
> >messages queue up behind the data messages.
> >
> >Regarding unsynchronized checkpoints: Thanks for the example, but I'm
> >still
> >a little confused. If all the processes are about to execute an
> >MPI_Barrier(), doesn't that mean they'll all be synchronized shortly
> >anyway? So what's the harm until waiting until they're synchronized and
> >then checkpointing?
> >
> >Regarding the simulation of non-Ethernet networks: I agree that the
> >biggest
> >obstacle to this is the lack of generality of the current gem5 network
> >components. I tried to take a step toward supporting other link types two
> >years ago (see http://reviews.gem5.org/r/1922) but someone shot me down
> >;).
> >We shouldn't try and fix that here, but we should also consciously try not
> >to make it any worse...
> >
> >Thanks for reading all the way to the end!
> >
> >Steve
> >
> >
> >On Fri, Jul 3, 2015 at 7:11 AM Gabor Dozsa <***@arm.com> wrote:
> >
> >>Hi all,
> >>
> >>Thank you Steve for the thorough review.
> >>
> >>First, let me elaborate a bit on Andreas’s 3rd point about
> >>non-synchronous
> >>checkpoints. Let’s assume that we aim to simulate MPI applications (HPC
> >>workloads). The ROI in an MPI application is typically starts with a
> >>global MPI_Barrier() call. We want to take the checkpoint when *every*
> >>gem5 process is reached that MPI_Barrier() in the simulated code but
> >>that
> >>may not happen at the same tick in each gem5 (due to load imbalance
> >>among
> >>the simulated nodes). That’s why multi-gem5 implements the
> >>non-synchronous
> >>checkpoint support.
> >>
> >>My answers to your questions are as follows.
> >>
> >>1. The only change necessary to use multi-gem5 with a non Ethernet
> >>(simulated) network is to replace the Ethernet packet type with another
> >>packet type in MultiIface.
> >>In fact, the first implementation of MultiIface was a template
> >>that took EthPacketData as parameter because I plan to support different
> >>network types. When I realized that currently only Ethernet is supported
> >>by gem5 I dropped the template param to keep the implementation
> >>simpler. I
> >>have also realized in the meantime that the right approach would
> >>probably
> >>be to create a pure virtual ‘base' class for network packets from which
> >>Ethernet (and other types of) packets could be derived. Then MultiIface
> >>could simply use that base class to provide support for different
> >>network
> >>types. The interface provided by the base packet class could be very
> >>simple. Beside the total size() of the packet, multi-gem5 only needs a
> >>method to ‘extract' the source/destination address. Those addresses are
> >>used in MultiIface as opaque byte arrays so they are quite network type
> >>agnostic already.
> >>
> >>2. That’s right, we have designed the MultiIface/TCPIface split with
> >>different underlaying messaging systems in mind.
> >>
> >>3. Multi-gem5 can work together with multi-threaded/multi-event-queue
> >>gem5
> >>configs. The current TCPIface/tcp_server components would still use
> >>sockets to send around the packets. So it is possible to put together a
> >>multi-gem5 simulation where each gem5 process has multiple event queues
> >>(and an independent simulation thread per event queue) but all the
> >>simulated Ethernet links would use sockets to forward every Ethernet
> >>packet to the tcp_server.
> >>
> >>If someone wanted to run only a single gem5 process to simulate an
> >>entire
> >>cluster (using one thread/event-queue per cluster node) then the current
> >>multi-gem5 implementation using sockets/tcp_server is not optimal. In
> >>that
> >>case, a better solution would be to provide a shared memory based
> >>implementation of the MultiIface virtual communication methods
> >>sendRaw()/recvRaw()/syncRaw() (i.e. a shared memory equivalent of
> >>TCPIface). In that implementation, the entire discrete tcp_sever
> >>component
> >>could be replaced with a shared data structure.
> >>
> >>4. You are right, the current implementation does not make it possible
> >>to
> >>construct an equivalent single-process simulation model for a multi-gem5
> >>run. However, a possible solution is a shared memory based
> >>implementation
> >>of the MultiIface virtual communication methods just as I described in
> >>the
> >>previous paragraph. The same implementation could then work with both
> >>multi-threaded/multi-event-queues and single-thread/single-event-queue
> >>gem5 configs.
> >>
> >>Thanks,
> >>- Gabor
> >>
> >>On 7/2/15, 7:20 PM, "Steve Reinhardt" <***@gmail.com> wrote:
> >>
> >>>Hi everyone,
> >>>
> >>>Sorry for taking so long to engage. This is a great development and I
> >>>think
> >>>both these patches are terrific contributions. Thanks to Mohammad,
> >>Gabor,
> >>>and everyone else involved.
> >>>
> >>>I agree with Andreas that we should start with some top-level goals &
> >>>assumptions, agree on those, and then we can sort out the detailed
> >>issues
> >>>based on a consistent view.
> >>>
> >>>I definitely agree with Andreas's first two points. The third one
> >>seems a
> >>>little surprising; I'd like to hear more about the motivation before
> >>>expressing an opinion. I can see where non-synchronous checkpointing
> >>could
> >>>be useful, but it's also clear from the associated patch that it's not
> >>>trivial to implement either. How much would be lost by requiring a
> >>>synchronization before a checkpoint?
> >>>
> >>>From my personal perspective, I would like to see whatever we do here
> >>be a
> >>>first step toward a more general distributed simulation platform. Both
> >>of
> >>>these patches seem pretty Ethernet-centric in different ways. This is
> >>not
> >>>terrible; part of the problem is that gem5's current internal
> >>networking
> >>>support is already overly Ethernet-centric IMO. But it would be nice to
> >>>avoid baking that in even further. Rather than assume I have understood
> >>>all
> >>>the code completely, I'll phrase things in the form of questions, and
> >>>people can comment on how those questions would be answered in the
> >>context
> >>>of the two different approaches.
> >>>
> >>>1. How much effort would be required to simulate a non-Ethernet
> >>network?
> >>>My
> >>>impression is that pd-gem5 has a leg up here, since a gem5 switch model
> >>>for
> >>>a non-Ethernet network (which you'd have to write anyway if you were
> >>>simulating a different network) could be used in place of the current
> >>>Ethernet switch, where for multi-gem5 I think that the
> >>>util/multi//tcp_server.cc code would have to be modified (i.e.,
> >>there'd be
> >>>additional work above and beyond what you'd need to get the network
> >>>modeled
> >>>in base gem5).
> >>>
> >>>2. How much effort is required to run on a non-Ethernet network (or
> >>>equivalently using a non-sockets API)? The MultiIface/TCPIface split
> >>in
> >>>the multi-gem5 code looks like it addresses this nicely, but pd-gem5
> >>seems
> >>>pretty tied to an Ethernet host fabric.
> >>>
> >>>3. Do both of these patches work with the existing multithreaded
> >>>multiple-event-queue simulation? I think multi-gem5 does (though it
> >>would
> >>>be nice to have a confirmation), but it's not clear about pd-gem5. I
> >>don't
> >>>see a benefit to having multiple gem5 processes on a single host vs. a
> >>>single multithreaded gem5 process using the existing support. I think
> >>this
> >>>could be particularly valuable with a hierarchical network; e.g.,
> >>maybe I
> >>>would want to model a rack in multithreaded mode on a single multicore
> >>>server, then use pd-gem5 or multi-gem5 to build up a simulation of
> >>>multiple
> >>>racks. Would this work out of the box with either of these patches,
> >>and if
> >>>not, what would need to be done?
> >>>
> >>>4. Is it possible to construct a single-process simulation model that's
> >>>identical to the distributed simulation? It would be very valuable for
> >>>verification to be able to take a single simulation run and do it both
> >>>within a single process and also across multiple processes and verify
> >>that
> >>>identical results are achieved. This seems like a big drawback to the
> >>>multi-gem5 tcp_server approach, IMO.
> >>>
> >>>I'm definitely not saying that all these issues need to be resolved
> >>before
> >>>anything gets committed, but if we can agree that these are valid
> >>goals,
> >>>then we can evaluate detailed issues based on whether they move us
> >>toward
> >>>or away from those goals.
> >>>
> >>>Thanks,
> >>>
> >>>Steve
> >>>
> >>>
> >>>On Thu, Jul 2, 2015 at 8:34 AM Andreas Hansson
> >><***@arm.com>
> >>>wrote:
> >>>
> >>>>Hi all,
> >>>>
> >>>>I think we need to up-level this a bit. From our perspective (and I
> >>>>suspect in general):
> >>>>
> >>>>1. Robustness is important. Having a design that _may_ break, however
> >>>>unlikely is simply not an option.
> >>>>
> >>>>2. Performance and scaling is important. We can compare actual numbers
> >>>>here, and I am fairly sure the two solutions are on par. Let’s
> >>quantify
> >>>>that though.
> >>>>
> >>>>3. Checkpointing must not rely on synchronicity. It is vital for
> >>several
> >>>>workloads that we can checkpoint the various gem5 instances at
> >>different
> >>>>Ticks (due to the way the workloads are constructed).
> >>>>
> >>>>Andreas
> >>>>
> >>>>On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
> >>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> >>>>
> >>>>>Thanks Gabor for the reply.
> >>>>>
> >>>>>I feel this conversation is useful as we can find out pros/cons of
> >>each
> >>>>>design.
> >>>>>Please find my response in-lined below.
> >>>>>
> >>>>>Thank you,
> >>>>>Mohammad
> >>>>>
> >>>>>On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com>
> >>>>wrote:
> >>>>>
> >>>>>>Hi All,
> >>>>>>
> >>>>>>Sorry for the missing indentation in my previous e-mail! (This was
> >>my
> >>>>>>first e-mail to the dev-list so I could not simply use “reply").
> >>>>Below
> >>>>>>is
> >>>>>>the same message, hopefully in more readable form.
> >>>>>>
> >>>>>>====================================
> >>>>>>
> >>>>>>Hi All,
> >>>>>>
> >>>>>>Thank you Mohammad for your elaboration on the issues!
> >>>>>>
> >>>>>>I have written most of the multi-gem5 patch so let me add some more
> >>>>>>clarifications and answer to your concerns. My comments are inline
> >>>>>>below.
> >>>>>>
> >>>>>>Thanks,
> >>>>>>- Gabor
> >>>>>>
> >>>>>>On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
> >>>>>>
> >>>>>>>Hi All,
> >>>>>>>
> >>>>>>>Curtis-Thank you for listing some of the differences. I was
> >>waiting
> >>>>for
> >>>>>>>the
> >>>>>>>completed multi-gem5 patch before I send my review. Please see my
> >>>>>>inline
> >>>>>>>response below. I¹ve addressed the concerns that you¹ve raised.
> >>>>Also,
> >>>>>>I¹ve
> >>>>>>>added a bit more to the comparison.
> >>>>>>>
> >>>>>>>-* Synchronization.
> >>>>>>>
> >>>>>>>pd-gem5 implements this in Python (not a problem in itself;
> >>>>>>aesthetically
> >>>>>>>
> >>>>>>>this is nice, but...). The issue is that pd-gem5's data packets
> >>and
> >>>>>>>
> >>>>>>>barrier messages travel over different sockets. Since pd-gem5
> >>could
> >>>>>>see
> >>>>>>>
> >>>>>>>data packets passing synchronization barriers, it could create an
> >>>>>>>
> >>>>>>>inconsistent checkpoint.
> >>>>>>>
> >>>>>>>multi-gem5's synchronization is implemented in C++ using sync
> >>>>events,
> >>>>>>but
> >>>>>>>
> >>>>>>>more importantly, the messages queue up in the same stream and so
> >>>>>>cannot
> >>>>>>>
> >>>>>>>have the issue just described. (Event ordering is often crucial
> >>in
> >>>>>>>
> >>>>>>>snapshot protocols.) Therefore we feel that multi-gem5 is a more
> >>>>robust
> >>>>>>>
> >>>>>>>solution in this respect.
> >>>>>>>
> >>>>>>>Each packet in pd-gem5 has a time-stamp. So even if data packets
> >>>>pass
> >>>>>>>synchronization barriers (in another word data packets arrive
> >>early
> >>>>at
> >>>>>>the
> >>>>>>>destination node), destination node process packets based on their
> >>>>>>>timestamp. Actually allowing data packets to pass sync barriers
> >>is a
> >>>>>>nice
> >>>>>>>feature that can reduce the likelihood of late packet reception.
> >>>>>>Ordering
> >>>>>>>of data messages that flow over pd-gem5 nodes is also preserved in
> >>>>>>pd-gem5
> >>>>>>>implementation.
> >>>>>>
> >>>>>>This seems to be a misunderstanding. Maybe the wording was not
> >>>>precise
> >>>>>>before.The problem is not a data packet that “passing" a sync
> >>barrier
> >>>>>>but the other way around, a sync barrier that can pass a data
> >>packet
> >>>>>>(e.g. while the data packet is waiting in the host operating system
> >>>>>>socket layer). If that happens, the packet will arrive later than
> >>it
> >>>>>>was
> >>>>>>supposed to and it may miss the computed receive tick.
> >>>>>>
> >>>>>>For instance, let’s assume that the quantum coincides with the
> >>>>simulated
> >>>>>>Ether link delay. (This is the optimal choice of quantum to
> >>minimize
> >>>>the
> >>>>>>number of sync barriers.) If a data packet is sent right at the
> >>>>>>beginning
> >>>>>>of a quantum then this packet must arrive at the destination gem5
> >>>>>>process
> >>>>>>within the same quantum in order not to miss its receive tick at
> >>the
> >>>>>>very
> >>>>>>beginning of the next quantum. If the sync barrier can pass the
> >>data
> >>>>>>packet
> >>>>>>then the data packet may arrive only during the next quantum (or
> >>in
> >>>>>>extreme conditions even later than that) so when it arrives the
> >>>>receiver
> >>>>>>gem5 may pass already the receive tick.
> >>>>>>
> >>>>>>This argument makes more sense than the previous one. Note that
> >>gem5
> >>>>is
> >>>>>>a
> >>>>>cycle accurate simulator and it runs orders of magnitude slower that
> >>>>real
> >>>>>hardware. So it's almost impossible that the flight time of packet
> >>>>through
> >>>>>real network turns to be more that simulation time of one quantum. We
> >>>>ran
> >>>>>a
> >>>>>set of experiments just for this purpose: with quantum size equal to
> >>>>>etherlink delay, we never got any late arrival violation (what you
> >>>>>described) for full NAS benchmarks suit (please refer to the paper).
> >>>>>
> >>>>>multi-gem5 is optimized for a case that almost never happens! and
> >>>>>scarifying speedup for no gain.
> >>>>>
> >>>>>
> >>>>>>Time-stamping does help with this issue. Also, if a data packet is
> >>>>>>waiting
> >>>>>>in the host operating system socket layer when the simulation
> >>thread
> >>>>>>exits
> >>>>>>to python to complete the next sync barrier then the packet will
> >>>>not go
> >>>>>>into the checkpoint that may follow that sync barrier.
> >>>>>>
> >>>>>>That's a good point. Current pd-gem5 checkpointing mechanism might
> >>>>miss
> >>>>>packets that have been sent during previous quantum and are waiting
> >>in
> >>>>OS
> >>>>>socket buffer. I should add some code inside ethertap serialization
> >>>>>function to drain ethertap socket before writing checkpoint. I will
> >>>>update
> >>>>>pd-gem5 patch accordingly.
> >>>>>
> >>>>>>
> >>>>>>>What you mentioned as an advantage for multi-gem5 is actually a
> >>key
> >>>>>>>disadvantage: buffering sync messages behind data packets can add
> >>>>up to
> >>>>>>>the
> >>>>>>>synchronization overhead and slow down simulation significantly.
> >>>>>>
> >>>>>>The purpose of sync messages is to make sure that the data packets
> >>>>>>arrive
> >>>>>>in time (in terms of simulated time) at the destination so they can
> >>>>be
> >>>>>>scheduled for being received at the proper computed tick. Sync
> >>>>messages
> >>>>>>also make sure that no data packets are in flight when a sync
> >>barrier
> >>>>>>completes before we take a checkpoint. They definitely add
> >>overhead
> >>>>for
> >>>>>>the simulation but they are necessary for the correctness of the
> >>>>>>simulation.
> >>>>>>
> >>>>>>The receive thread in multi-gem5 reads out packets from the socket
> >>in
> >>>>>>parallel with the simulation thread so packets normally will not be
> >>>>>>"queueing up” before a sync barrier message. There is definitely
> >>>>room
> >>>>>>for improvements in the current implementation for reducing the
> >>>>>>synchronization overhead but that is likely true for pd-gem5, too.
> >>>>>>The important thing here is that the solution must provide
> >>>>correctness
> >>>>>>(robustness) first.
> >>>>>>
> >>>>>>pd-gem5 provides correctness. Please read my previous comment. The
> >>>>whole
> >>>>>purpose of multi/pd-gem5 is to parallelize simulation with minimal
> >>>>>overhead
> >>>>>and gain speedup. If you fail to do so, nobody will use your tool.
> >>>>>
> >>>>>
> >>>>>>>Also,
> >>>>>>>multi-gem5 send huge sized messages (multiHeaderPkt) through
> >>>>network to
> >>>>>>>perform each synchronization point, which increases
> >>synchronization
> >>>>>>>overhead further. In pd-gem5, we choose to send just one character
> >>>>as
> >>>>>>sync
> >>>>>>>message through a separate socket to reduce synchronization
> >>>>overhead.
> >>>>>>
> >>>>>>The TCP/IP message size is unlikely the bottleneck here. Multi-gem5
> >>>>will
> >>>>>>send ~50 bytes more in a sync barrier message than pd-gem5 but that
> >>>>>>bigger
> >>>>>>sync message still fits into a single ethernet frame on the wire.
> >>The
> >>>>>>end-to-end latency overhead that is caused by 50 bytes extra
> >>payload
> >>>>for
> >>>>>>a small single frame TCP/IP message is likely to fall into the
> >>>>“noise"
> >>>>>>category if one tries to measure it in a real cluster.
> >>>>>>
> >>>>>>You should prove your hypothesis experimentally. Each gem5 process
> >>>>>send/receive sync messages at the end of every quantum. Say you are
> >>>>>simulating "N" node computer cluster with "M" different
> >>configuration.
> >>>>>Then
> >>>>>you will have N*M gem5 processes that send/receive these 50 Bytes (it
> >>>>>think
> >>>>>it's more) extra data at the same time over network ...
> >>>>>
> >>>>>Furthermore, multi-gem5 send a header before each data message.
> >>>>Comparing
> >>>>>with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12 least
> >>>>>significant digits of the Tick) to each data packet. I don't know
> >>>>exactly
> >>>>>how large are these "MultiHeaderPkt", but it just has two Tick field
> >>>>that
> >>>>>each is 64 Bytes! Also, header packets are separate TCP packets, so
> >>you
> >>>>>pay
> >>>>>for sending two separate packets for each data packet. And worst, you
> >>>>>serialize all of these with sync messages.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>* Packet handling.
> >>>>>>>
> >>>>>>>pd-gem5 uses EtherTap for data packets but changed the polling
> >>>>>>mechanism
> >>>>>>>
> >>>>>>>to go through the main event queue. Since this rate is actually
> >>>>linked
> >>>>>>>
> >>>>>>>with simulator progress, it cannot guarantee that the packets are
> >>>>>>>serviced
> >>>>>>>
> >>>>>>>at regular intervals of real time. This can lead to packets
> >>>>queueing
> >>>>>>up
> >>>>>>>
> >>>>>>>which would contribute to the synchronization issues mentioned
> >>>>above.
> >>>>>>>
> >>>>>>>multi-gem5 uses plain sockets with separate receive threads and so
> >>>>does
> >>>>>>>not
> >>>>>>>
> >>>>>>>have this issue.
> >>>>>>>
> >>>>>>>I think again you are pointing to your first concern that I¹ve
> >>>>>>explained
> >>>>>>>above. Packets that have queued up in EtherTap socket, will be
> >>>>>>processed
> >>>>>>>and delivered to simulation environment at the beginning of next
> >>>>>>>simulation
> >>>>>>>quantum.
> >>>>>>>
> >>>>>>>Please notice that multi-gem5 introduces a new simObjects to
> >>>>interface
> >>>>>>>simulation environment to real world which is redundant. This
> >>>>>>>functionality
> >>>>>>>is already there by EtherTap.
> >>>>>>
> >>>>>>Except that the EtherTap solution does not provide a correct
> >>(robust)
> >>>>>>solution for the synchronization problem.
> >>>>>>
> >>>>>>Please read my first/second comments.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>* Checkpoint accuracy.
> >>>>>>>
> >>>>>>>A user would like to have a checkpoint at precisely the time the
> >>>>>>>
> >>>>>>>'m5 checkpoint' operation is executed so as to not miss any of the
> >>>>>>>
> >>>>>>>area of interest in his application.
> >>>>>>>
> >>>>>>>pd-gem5 requires that simulation finish the current quantum
> >>>>>>>
> >>>>>>>before checkpointing, so it cannot provide this.
> >>>>>>>
> >>>>>>>(Shortening the quantum can help, but usually the snapshot is
> >>being
> >>>>>>taken
> >>>>>>>
> >>>>>>>while 'fast-forwarding', i.e. simulating as fast as possible,
> >>which
> >>>>>>would
> >>>>>>>
> >>>>>>>motivate a longer quantum.)
> >>>>>>>
> >>>>>>>multi-gem5 can enter the drain cycle immediately upon receiving a
> >>>>>>>
> >>>>>>>checkpoint request. We find this accuracy highly desirable.
> >>>>>>>
> >>>>>>>It¹s true that if you have a large quantum size then there would
> >>be
> >>>>>>some
> >>>>>>>discrepancy between the m5_ckpt instruction tick and the actual
> >>dump
> >>>>>>tick.
> >>>>>>>Based on multi-gem5 code, my understanding is that you send async
> >>>>>>>checkpoint message as soon as one of the gem5 processes encounter
> >>>>>>m5_ckpt
> >>>>>>>instruction. But I¹m not sure how you fix the aforementioned
> >>issue,
> >>>>>>>because
> >>>>>>>you have to sync all gem5 processes before you start dumping
> >>>>>>checkpoint,
> >>>>>>>which necessitate a global synchronization beforehand.
> >>>>>>
> >>>>>>In multi-gem5, the gem5 process who encounters the m5_ckpt
> >>>>instruction
> >>>>>>sends out an async checkpoint notification for the peer gem5
> >>>>processes
> >>>>>>and
> >>>>>>then it starts the draining immediately (at the same tick). So the
> >>>>>>checkpoint will be taken at the exact tick form the initiator
> >>process
> >>>>>>point of view. The global synchronisation with the peer processes
> >>>>takes
> >>>>>>place while the initiator process is still waiting at the same tick
> >>>>(i.e
> >>>>>>the simulation thread is suspended). However, the receiver thread
> >>>>>>Continues reading out the socket - while waiting for the global
> >>sync
> >>>>to
> >>>>>>complete- to make sure that in-flight data packets from peer gem5
> >>>>>>processes
> >>>>>>are stored properly and saved into the checkpoint.
> >>>>>>
> >>>>>>
> >>>>>So you mean multi-gem5 ends up with having gem5 processes with
> >>>>different
> >>>>>ticks after checkpoint? In pd-gem5 we make sure that all gem5
> >>processes
> >>>>>start dumping checkpoint at the same tick. Are you sure that this is
> >>>>>correct to have each gem5 process dump checkpoint at different
> >>ticks???
> >>>>>
> >>>>>I don't think this a correct checkpointing design. However, if you
> >>>>feel it
> >>>>>is correct, I can change a couple of lines in "Simulation.py" and
> >>>>barrier
> >>>>>scripts to implement the same functionality in pd-gem5. One thing
> >>that
> >>>>you
> >>>>>are obsessed about is to make sure that there is no in-flight packets
> >>>>>while
> >>>>>we start dumping checkpoint, and you have all these complex
> >>mechanisms
> >>>>in
> >>>>>place to insure that! I think you can 99.99999% make sure that there
> >>>>is no
> >>>>>in-flight packet by waiting for 1 second after all gem5 processes
> >>>>finished
> >>>>>their quantum simulation and then dump checkpoint. Do you really
> >>think
> >>>>>that
> >>>>>delivering a tcp packet would take more than 1 second in today's
> >>>>systems!?
> >>>>>Always go for simple solutions ...
> >>>>>
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>By the way, we have a fix for this issue by introducing a new m5
> >>>>pseudo
> >>>>>>>instruction.
> >>>>>>
> >>>>>>I fail to see how a new pseudo instruction can solve the problem of
> >>>>>>completing the full quantum in pd-gem5 before a checkpoint can be
> >>>>taken.
> >>>>>>Could you please elaborate on that?
> >>>>>>
> >>>>>>As we take checkpoint while fast-forwarding and it is likely that
> >>we
> >>>>>>relax
> >>>>>synchronization for speedup purpose, a new pseudo instruction that
> >>can
> >>>>set
> >>>>>quantum size (m5_qset) can be helpful. So, one can insert m5_qset in
> >>>>his
> >>>>>benchmark source code before entering ROI that contains m5_ckpt to
> >>>>>decrease
> >>>>>quantum size beforehand and reduce the discrepancy between m5_ckpt
> >>tick
> >>>>>and
> >>>>>actual checkpoint tick. This is not included in pd-gem5 patch right
> >>>>now.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>* Implementation of network topology.
> >>>>>>>
> >>>>>>>pd-gem5 uses a separate gem5 process to act as a switch whereas
> >>>>>>multi-gem5
> >>>>>>>
> >>>>>>>uses a standalone packet relay process.
> >>>>>>>
> >>>>>>>We haven't measured the overhead of pd-gem5's simulated switch
> >>yet,
> >>>>but
> >>>>>>>
> >>>>>>>we're confident that our approach is at least as fast and more
> >>>>>>scalable.
> >>>>>>>
> >>>>>>>There is this flexibility in pd-gem5 to simulate a switch box
> >>>>alongside
> >>>>>>>one
> >>>>>>>of the other gem5 processes. However, it might make that gem5
> >>>>process
> >>>>>>the
> >>>>>>>simulation bottleneck. One of the advantages of pd-gem5 over
> >>>>>>multi-gem5 is
> >>>>>>>that we use gem5 to simulate a switch box, which allows us to
> >>model
> >>>>any
> >>>>>>>network topology by instantiating several Switch simObjects and
> >>>>>>>interconnect them with EhterLink in an arbitrary fashion. A
> >>>>standalone
> >>>>>>tcp
> >>>>>>>server just can provide switch functionality (forwarding packets
> >>to
> >>>>>>>destinations) and model a star network topology. Furthermore, it
> >>>>cannot
> >>>>>>>model various network timings such as queueing delay, congestion,
> >>>>and
> >>>>>>>routing latency. Also it has some accuracy issues that I will
> >>point
> >>>>out
> >>>>>>>next.
> >>>>>>
> >>>>>>I agree with the complex topology argument. We already mentioned
> >>that
> >>>>>>before as an advantage for pd-gem5 from the point of view of future
> >>>>>>extensions. However, I do not agree that multi-gem5 cannot model
> >>>>>>queueing
> >>>>>>delays and congestions. For a simple crossbar switch, it can model
> >>>>>>queueing
> >>>>>>delays and congestions, but the receive queues are distributed
> >>among
> >>>>the
> >>>>>>gem5 processes.
> >>>>>>
> >>>>>>It's true that you can model queuing delay of a simple crossbar by
> >>>>>distributing queues across gem5 processes (end points). But to be
> >>able
> >>>>to
> >>>>>do so you have to ensure the ordering of packets that you enqueue in
> >>>>the
> >>>>>distributed queues. It is almost impossible without a synchronized
> >>>>switch
> >>>>>box. You should have a reorder queue that reorders packets
> >>dynamically
> >>>>and
> >>>>>updates timing parameter for each packet as well. I don't know how
> >>much
> >>>>>progress have you had to ensure ordering scheme in multi-gem5 but you
> >>>>may
> >>>>>already realized that how complex and error prone it can be. This
> >>>>argument
> >>>>>is also related to my next argument for "Broken network timing".
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>* Broken network timing:
> >>>>>>>
> >>>>>>>Forwarding packets between gem5 processes using a standalone tcp
> >>>>server
> >>>>>>>can
> >>>>>>>cause reordering between packets that have different source but
> >>same
> >>>>>>>destination. It causes inaccurate network timing and worse of all
> >>>>>>>non-deterministic simulation. pd-gem5 resolve this by reordering
> >>>>>>packets
> >>>>>>>at
> >>>>>>>Switch process and then send them to their destination (it¹s
> >>>>possible
> >>>>>>as
> >>>>>>>switch is synchronized with the rest of the nodes).
> >>>>>>
> >>>>>>In multi-gem5, there is always a HeaderPkt that contains some meta
> >>>>>>information for each data packet. The meta information include the
> >>>>send
> >>>>>>tick and the sender rank (i.e. a unique ID of the sender gem5
> >>>>process).
> >>>>>>We use those information to define a well defined ordering of
> >>packets
> >>>>>>even
> >>>>>>if packets are arriving at the same receiver from different
> >>senders.
> >>>>>>This
> >>>>>>packet ordering scheme is still being tested so the corresponding
> >>>>patch
> >>>>>>is
> >>>>>>not on the RB yet.
> >>>>>>
> >>>>>>Please read my previous comment. The most important part of
> >>>>>>multi/pd-gem5
> >>>>>extension is ensuring accurate and deterministic simulation.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>* Amount of changes
> >>>>>>>
> >>>>>>>pd-gem5 introduce different modes in etherlink just to provide
> >>>>accurate
> >>>>>>>timing for each component in the network subsystem (NIC, link,
> >>>>switch)
> >>>>>>as
> >>>>>>>well as capability of modeling different network topologies (mesh,
> >>>>>>ring,
> >>>>>>>fat tree, etc). To enable a simple functionality, like what
> >>>>multi-gem5
> >>>>>>>provides, the amount of changes in gem5 can be limited to
> >>>>time-stamping
> >>>>>>>packets and providing synchronization through python scripts.
> >>>>However,
> >>>>>>>multi-gem5 re-implements functionalists that are already in gem5.
> >>>>>>
> >>>>>>This argument holds only if both implementations are correct
> >>>>(robust).
> >>>>>>It
> >>>>>>still seems to me that pd-gem5 does not provide correctness for the
> >>>>>>synchronization/checkpointing parts.
> >>>>>>
> >>>>>>Again, please read my first comment for correctness of pd-gem5.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>* Integrating with gem5 mainstream:
> >>>>>>>
> >>>>>>>pd-gem5 launch script is written in python which is suited for
> >>>>>>integration
> >>>>>>>with gem5 python scripts. However multi-gem5 uses bash script.
> >>Also,
> >>>>>>all
> >>>>>>>source files in pd-gem5 are already parts of gem5 mainstream.
> >>>>However
> >>>>>>>multi-gem5 has tcp_server.cc/hh that is a standalone process and
> >>>>cannot
> >>>>>>be
> >>>>>>>part of gem5.
> >>>>>>
> >>>>>>The multi-gem5 launch script is simply enough to rely only on the
> >>>>>>shell. It
> >>>>>>can obviously be easily re-written in python if that added any
> >>value.
> >>>>>>The
> >>>>>>tcp_server component is only a utility (like the "m5" utility that
> >>is
> >>>>>>also
> >>>>>>part of gem5).
> >>>>>>
> >>>>>>The thing is that it's more likely that users want to add some
> >>>>>functionality to the run-script of multi/pd-gem5. E.g. pd-gem5
> >>>>run-script
> >>>>>supports launching simulations using a simulation pool management
> >>>>>software (
> >>>>>http://research.cs.wisc.edu/htcondor/). Using python enables users to
> >>>>>easily add these kind of supports.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>Cheers,
> >>>>>>- Gabor
> >>>>>>
> >>>>>>
> >>>>>>>On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham
> >>>><***@arm.com>
> >>>>>>>wrote:
> >>>>>>>
> >>>>>>>>Hello everyone,
> >>>>>>>>We have taken a look at how pd-gem5 compares with multi-gem5.
> >>>>While
> >>>>>>>>intending
> >>>>>>>>to deliver the same functionality, there are some crucial
> >>>>differences:
> >>>>>>>>
> >>>>>>>>* Synchronization.
> >>>>>>>>
> >>>>>>>> pd-gem5 implements this in Python (not a problem in itself;
> >>>>>>>>aesthetically
> >>>>>>>> this is nice, but...). The issue is that pd-gem5's data
> >>>>packets
> >>>>>>and
> >>>>>>>> barrier messages travel over different sockets. Since
> >>pd-gem5
> >>>>>>could
> >>>>>>>>see
> >>>>>>>> data packets passing synchronization barriers, it could
> >>create
> >>>>an
> >>>>>>>> inconsistent checkpoint.
> >>>>>>>>
> >>>>>>>> multi-gem5's synchronization is implemented in C++ using sync
> >>>>>>events,
> >>>>>>>>but
> >>>>>>>> more importantly, the messages queue up in the same stream
> >>and
> >>>>so
> >>>>>>>>cannot
> >>>>>>>> have the issue just described. (Event ordering is often
> >>>>crucial
> >>>>>>in
> >>>>>>>> snapshot protocols.) Therefore we feel that multi-gem5 is a
> >>>>more
> >>>>>>>>robust
> >>>>>>>> solution in this respect.
> >>>>>>>>
> >>>>>>>>* Packet handling.
> >>>>>>>>
> >>>>>>>> pd-gem5 uses EtherTap for data packets but changed the
> >>polling
> >>>>>>>>mechanism
> >>>>>>>> to go through the main event queue. Since this rate is
> >>>>actually
> >>>>>>>>linked
> >>>>>>>> with simulator progress, it cannot guarantee that the packets
> >>>>are
> >>>>>>>>serviced
> >>>>>>>> at regular intervals of real time. This can lead to packets
> >>>>>>>>queueing up
> >>>>>>>> which would contribute to the synchronization issues
> >>mentioned
> >>>>>>above.
> >>>>>>>>
> >>>>>>>> multi-gem5 uses plain sockets with separate receive threads
> >>>>and so
> >>>>>>>>does
> >>>>>>>>not
> >>>>>>>> have this issue.
> >>>>>>>>
> >>>>>>>>* Checkpoint accuracy.
> >>>>>>>>
> >>>>>>>> A user would like to have a checkpoint at precisely the time
> >>the
> >>>>>>>> 'm5 checkpoint' operation is executed so as to not miss any of
> >>>>the
> >>>>>>>> area of interest in his application.
> >>>>>>>>
> >>>>>>>> pd-gem5 requires that simulation finish the current quantum
> >>>>>>>> before checkpointing, so it cannot provide this.
> >>>>>>>>
> >>>>>>>> (Shortening the quantum can help, but usually the snapshot is
> >>>>being
> >>>>>>>>taken
> >>>>>>>> while 'fast-forwarding', i.e. simulating as fast as possible,
> >>>>which
> >>>>>>>>would
> >>>>>>>> motivate a longer quantum.)
> >>>>>>>>
> >>>>>>>> multi-gem5 can enter the drain cycle immediately upon
> >>receiving
> >>>>a
> >>>>>>>> checkpoint request. We find this accuracy highly desirable.
> >>>>>>>>
> >>>>>>>>* Implementation of network topology.
> >>>>>>>>
> >>>>>>>> pd-gem5 uses a separate gem5 process to act as a switch
> >>whereas
> >>>>>>>>multi-gem5
> >>>>>>>> uses a standalone packet relay process.
> >>>>>>>>
> >>>>>>>> We haven't measured the overhead of pd-gem5's simulated switch
> >>>>yet,
> >>>>>>>>but
> >>>>>>>> we're confident that our approach is at least as fast and more
> >>>>>>>>scalable.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>Thanks,
> >>>>>>>>Curtis
> >>>>>>>>________________________________________
> >>>>>>>>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad
> >>>>>>Alian [
> >>>>>>>>***@wisc.edu]
> >>>>>>>>Sent: Friday, June 26, 2015 7:37 PM
> >>>>>>>>To: gem5 Developer List
> >>>>>>>>Subject: Re: [gem5-dev] pd-gem5: simulating a
> >>parallel/distributed
> >>>>>>>>system
> >>>>>>>>on multiple physical hosts
> >>>>>>>>
> >>>>>>>>Hi Anthony,
> >>>>>>>>
> >>>>>>>>I think that would be a good option, then I can add pd-gem5
> >>>>>>>>functionality
> >>>>>>>>on top of that. Right now I've simplified your implementation.
> >>>>Also, I
> >>>>>>>>think I had found some bugs in your patch that I cannot remember
> >>>>now.
> >>>>>>If
> >>>>>>>>you decided to ship EtherSwitch patch, let me know to give you a
> >>>>>>review
> >>>>>>>>on
> >>>>>>>>that.
> >>>>>>>>
> >>>>>>>>Thanks,
> >>>>>>>>Mohammad
> >>>>>>>>
> >>>>>>>>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
> >>>>>>>>***@amd.com> wrote:
> >>>>>>>>
> >>>>>>>>>Would it make sense for me to ship the EtherSwitch patch first,
> >>>>since
> >>>>>>>>it
> >>>>>>>>>has utility on its own, and then we can decide which of the
> >>>>>>>>"multi-gem5"
> >>>>>>>>>approaches is best, or if it's some combination of both?
> >>>>>>>>>
> >>>>>>>>>The only reason I never shipped it was because Steve raised an
> >>>>issue
> >>>>>>>>that
> >>>>>>>>>I didn't have a good alternative for, and didn't have the time
> >>to
> >>>>>>look
> >>>>>>>>into
> >>>>>>>>>one at that time.
> >>>>>>>>>________________________________________
> >>>>>>>>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of Mohammad
> >>>>>>>>Alian [
> >>>>>>>>>***@wisc.edu]
> >>>>>>>>>Sent: Wednesday, June 24, 2015 12:43 PM
> >>>>>>>>>To: gem5 Developer List
> >>>>>>>>>Subject: Re: [gem5-dev] pd-gem5: simulating a
> >>parallel/distributed
> >>>>>>>>system
> >>>>>>>>>on multiple physical hosts
> >>>>>>>>>
> >>>>>>>>>Hi Andreas,
> >>>>>>>>>
> >>>>>>>>>Thanks for the comment.
> >>>>>>>>>I think the checkpointing support in both works is the same.
> >>Here
> >>>>is
> >>>>>>>>how
> >>>>>>>>>checkpointing support is implemented in pd-gem5:
> >>>>>>>>>
> >>>>>>>>>Whenever one of gem5 processes encounter an m5-checkpoint pseudo
> >>>>>>>>>instruction, it will send a ³recv-ckpt² signal to the
> >>>>>>>>>³barrier² process. Then the ³barrier² process sends a
> >>³take-ckpt²
> >>>>>>>>signal
> >>>>>>>>to
> >>>>>>>>>all the simulated nodes
> >>>>>>>>>(including the node that encountered m5-checkpoint) at the end
> >>of
> >>>>the
> >>>>>>>>>current simulation quantum. On the reception of
> >>>>>>>>>³take-ckpt² signal, gem5 processes start dumping check-points.
> >>>>This
> >>>>>>>>makes
> >>>>>>>>>each simulated node dump a checkpoint
> >>>>>>>>>at the same simulated time point while ensuring there is no
> >>>>in-flight
> >>>>>>>>>packets.
> >>>>>>>>>
> >>>>>>>>>I believe this is the same as multi-gem5 patch approach for
> >>>>>>checkpoint
> >>>>>>>>>support (based on the commit message of
> >>>>>>>>http://reviews.gem5.org/r/2865/
> >>>>>>>>).
> >>>>>>>>>Also, we have tested our mechanism with several benchmarks and
> >>it
> >>>>>>>>works.
> >>>>>>>>As
> >>>>>>>>>Steve suggested, I'll look into Curtis's patch and try to review
> >>>>it
> >>>>>>as
> >>>>>>>>>well.
> >>>>>>>>>But as Nilay also mentioned earlier, there are some codes
> >>missing
> >>>>in
> >>>>>>>>>Curtis's patch. I prefer to first run multi-gem5 before starting
> >>>>to
> >>>>>>>>review
> >>>>>>>>>it.
> >>>>>>>>>
> >>>>>>>>>Thank you,
> >>>>>>>>>Mohammad
> >>>>>>>>>
> >>>>>>>>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
> >>>>>>>>***@arm.com>
> >>>>>>>>>wrote:
> >>>>>>>>>
> >>>>>>>>>>Hi Steve,
> >>>>>>>>>>
> >>>>>>>>>>Apologies for the confusion. We are on the same page. My point
> >>is
> >>>>>>>>that
> >>>>>>>>we
> >>>>>>>>>>cannot simply take a little bit of patch A and a little bit of
> >>>>>>>>patch B.
> >>>>>>>>>>This change involves a lot of code, and we need to approach
> >>this
> >>>>in
> >>>>>>>>a
> >>>>>>>>>>structured fashion. My proposal is to do it bottom up, and
> >>start
> >>>>by
> >>>>>>>>>>getting the basic support in place. Since
> >>>>>>>>>http://reviews.gem5.org/r/2826/
> >>>>>>>>>>has already been on the review board for a few months, I am
> >>>>merely
> >>>>>>>>>>suggesting that the it would be a good start to relate the
> >>newly
> >>>>>>>>posted
> >>>>>>>>>>patches to what is already there.
> >>>>>>>>>>
> >>>>>>>>>>Andreas
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> >>>>>>>>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com>
> >>wrote:
> >>>>>>>>>>
> >>>>>>>>>>>Hi Andreas,
> >>>>>>>>>>>
> >>>>>>>>>>>I'm a little confused by your email---you say you're
> >>>>fundamentally
> >>>>>>>>>opposed
> >>>>>>>>>>>to looking at both patches and picking the best features, then
> >>>>you
> >>>>>>>>point
> >>>>>>>>>>>out that the patches Curtis posted have the feature of better
> >>>>>>>>>>>checkpointing
> >>>>>>>>>>>support so we should pick that :).
> >>>>>>>>>>>
> >>>>>>>>>>>Obviously we can't just pick patch A from Mohammad's set and
> >>>>patch
> >>>>>>>>B
> >>>>>>>>>from
> >>>>>>>>>>>Curtis's set and expect them to work together, but I think
> >>that
> >>>>>>>>having
> >>>>>>>>>>>both
> >>>>>>>>>>>sets of patches available and comparing and contrasting the
> >>two
> >>>>>>>>>>>implementations should enable us to get to a single
> >>>>implementation
> >>>>>>>>>that's
> >>>>>>>>>>>the best of both. Someone will have to make the effort of
> >>>>>>>>integrating
> >>>>>>>>>the
> >>>>>>>>>>>better ideas from one set into the other set to create a new
> >>>>>>>>unified
> >>>>>>>>set
> >>>>>>>>>>>of
> >>>>>>>>>>>patches; (or maybe we commit one set and then integrate the
> >>>>best of
> >>>>>>>>the
> >>>>>>>>>>>other set as patches on top of that), but the first step is to
> >>>>>>>>identify
> >>>>>>>>>>>what "the best of both" is. Having Mohammad look at Curtis's
> >>>>>>>>patches,
> >>>>>>>>>and
> >>>>>>>>>>>Curtis (or someone else from ARM) closely examine Mohammad's
> >>>>>>>>patches
> >>>>>>>>>would
> >>>>>>>>>>>be a great start. I intend to review them both, though
> >>>>>>>>unfortunately
> >>>>>>>>my
> >>>>>>>>>>>time has been scarce lately---I'm hoping to squeeze that in
> >>>>later
> >>>>>>>>this
> >>>>>>>>>>>week.
> >>>>>>>>>>>
> >>>>>>>>>>>Once we've had a few people look at both, we can discuss the
> >>>>pros
> >>>>>>>>and
> >>>>>>>>>cons
> >>>>>>>>>>>of each, then discuss the strategy for getting the best
> >>features
> >>>>>>>>in.
> >>>>>>>>So
> >>>>>>>>>>>far I've heard that Mohammad's patches have a better network
> >>>>model
> >>>>>>>>but
> >>>>>>>>>the
> >>>>>>>>>>>ARM patches have better checkpointing support; that seems
> >>like a
> >>>>>>>>good
> >>>>>>>>>>>start.
> >>>>>>>>>>>
> >>>>>>>>>>>Steve
> >>>>>>>>>>>
> >>>>>>>>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> >>>>>>>>>***@arm.com
> >>>>>>>>>>>
> >>>>>>>>>>>wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>>Hi all,
> >>>>>>>>>>>>
> >>>>>>>>>>>>Great work. However, I fundamentally do not believe in the
> >>>>>>>>approach
> >>>>>>>>of
> >>>>>>>>>>>>Œletting reviewers pick the best features¹. There is no way
> >>we
> >>>>>>>>would
> >>>>>>>>>>>>ever
> >>>>>>>>>>>>get something working out if it. We need to get _one_ working
> >>>>>>>>solution
> >>>>>>>>>>>>here, and figure out how to best get there. I would propose
> >>to
> >>>>>>>>do it
> >>>>>>>>>>>>bottom up, starting with the basic multi-simulator instance
> >>>>>>>>support,
> >>>>>>>>>>>>checkpointing support, and then move on to the network
> >>between
> >>>>>>>>the
> >>>>>>>>>>>>simulator instances.
> >>>>>>>>>>>>
> >>>>>>>>>>>>Thus, I propose we go with the low-level plumbing and
> >>>>checkpoint
> >>>>>>>>>support
> >>>>>>>>>>>>from what Curtis has posted. I believe proper checkpointing
> >>>>>>>>support
> >>>>>>>>to
> >>>>>>>>>>>>be
> >>>>>>>>>>>>the most challenging, and from what I can tell this is far
> >>more
> >>>>>>>>>limited
> >>>>>>>>>>>>in
> >>>>>>>>>>>>what you just posted Mohammad. Could you perhaps review
> >>Curtis
> >>>>>>>>patches
> >>>>>>>>>>>>based on your insights, and we can try and get these patches
> >>in
> >>>>>>>>shape
> >>>>>>>>>>>>and
> >>>>>>>>>>>>committed asap.
> >>>>>>>>>>>>
> >>>>>>>>>>>>Once we have the baseline functionality in place, then we can
> >>>>>>>>start
> >>>>>>>>>>>>looking at the more elaborate network models.
> >>>>>>>>>>>>
> >>>>>>>>>>>>Does this sound reasonable?
> >>>>>>>>>>>>
> >>>>>>>>>>>>Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>>Andreas
> >>>>>>>>>>>>
> >>>>>>>>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> >>>>>>>>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu>
> >>wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>Hello All,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>I have submitted a chain of patches which enables gem5 to
> >>>>>>>>simulate
> >>>>>>>>a
> >>>>>>>>>>>>>cluster on multiple physical hosts:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>http://reviews.gem5.org/r/2909/
> >>>>>>>>>>>>>http://reviews.gem5.org/r/2910/
> >>>>>>>>>>>>>http://reviews.gem5.org/r/2912/
> >>>>>>>>>>>>>http://reviews.gem5.org/r/2913/
> >>>>>>>>>>>>>http://reviews.gem5.org/r/2914/
> >>>>>>>><http://reviews.gem5.org/r/2914/>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>and a patch that contains run scripts for a simple
> >>experiment:
> >>>>>>>>>>>>>http://reviews.gem5.org/r/2915/
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>We have run several benchmarks using this infrastructure,
> >>>>>>>>including
> >>>>>>>>>NAS
> >>>>>>>>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
> >>>>>>>>>>>>>(http://prof.ict.ac.cn/DCBench/),
> >>>>>>>>>>>>>and would be happy to share scripts/diskimages.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or
> >>>>less
> >>>>>>>>the
> >>>>>>>>>>>>same
> >>>>>>>>>>>>>as
> >>>>>>>>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
> >>>>>>>>*network
> >>>>>>>>>>>>model
> >>>>>>>>>>>>>is
> >>>>>>>>>>>>>more thorough; it also enables modeling different network
> >>>>>>>>topologies.
> >>>>>>>>>>>>>Having both set of changes together let reviewers to pick
> >>best
> >>>>>>>>>features
> >>>>>>>>>>>>>from both works.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>Thank you,
> >>>>>>>>>>>>>Mohammad Alian
> >>>>>>>>>>>>>_______________________________________________
> >>>>>>>>>>>>>gem5-dev mailing list
> >>>>>>>>>>>>>gem5-***@gem5.org
> >>>>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >>>>>>>>attachments
> >>>>>>>>>are
> >>>>>>>>>>>>confidential and may also be privileged. If you are not the
> >>>>>>>>intended
> >>>>>>>>>>>>recipient, please notify the sender immediately and do not
> >>>>>>>>disclose
> >>>>>>>>>the
> >>>>>>>>>>>>contents to any other person, use it for any purpose, or
> >>store
> >>>>or
> >>>>>>>>copy
> >>>>>>>>>>>>the
> >>>>>>>>>>>>information in any medium. Thank you.
> >>>>>>>>>>>>
> >>>>>>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge
> >>CB1
> >>>>>>>>9NJ,
> >>>>>>>>>>>>Registered in England & Wales, Company No: 2557590
> >>>>>>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
> >>>>Cambridge
> >>>>>>>>CB1
> >>>>>>>>>>>>9NJ,
> >>>>>>>>>>>>Registered in England & Wales, Company No: 2548782
> >>>>>>>>>>>>_______________________________________________
> >>>>>>>>>>>>gem5-dev mailing list
> >>>>>>>>>>>>gem5-***@gem5.org
> >>>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>>>
> >>>>>>>>>>>_______________________________________________
> >>>>>>>>>>>gem5-dev mailing list
> >>>>>>>>>>>gem5-***@gem5.org
> >>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >>>>attachments
> >>>>>>>>are
> >>>>>>>>>>confidential and may also be privileged. If you are not the
> >>>>intended
> >>>>>>>>>>recipient, please notify the sender immediately and do not
> >>>>disclose
> >>>>>>>>the
> >>>>>>>>>>contents to any other person, use it for any purpose, or store
> >>or
> >>>>>>>>copy
> >>>>>>>>>the
> >>>>>>>>>>information in any medium. Thank you.
> >>>>>>>>>>
> >>>>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >>>>9NJ,
> >>>>>>>>>>Registered in England & Wales, Company No: 2557590
> >>>>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
> >>Cambridge
> >>>>CB1
> >>>>>>>>9NJ,
> >>>>>>>>>>Registered in England & Wales, Company No: 2548782
> >>>>>>>>>>_______________________________________________
> >>>>>>>>>>gem5-dev mailing list
> >>>>>>>>>>gem5-***@gem5.org
> >>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>>
> >>>>>>>>>_______________________________________________
> >>>>>>>>>gem5-dev mailing list
> >>>>>>>>>gem5-***@gem5.org
> >>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>_______________________________________________
> >>>>>>>>>gem5-dev mailing list
> >>>>>>>>>gem5-***@gem5.org
> >>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>>
> >>>>>>>>_______________________________________________
> >>>>>>>>gem5-dev mailing list
> >>>>>>>>gem5-***@gem5.org
> >>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>
> >>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >>attachments
> >>>>>>are
> >>>>>>>>confidential and may also be privileged. If you are not the
> >>>>intended
> >>>>>>>>recipient, please notify the sender immediately and do not
> >>disclose
> >>>>>>the
> >>>>>>>>contents to any other person, use it for any purpose, or store or
> >>>>copy
> >>>>>>>>the
> >>>>>>>>information in any medium. Thank you.
> >>>>>>>>
> >>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >>>>9NJ,
> >>>>>>>>Registered in England & Wales, Company No: 2557590
> >>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> >>>>CB1
> >>>>>>>>9NJ,
> >>>>>>>>Registered in England & Wales, Company No: 2548782
> >>>>>>>>
> >>>>>>>>_______________________________________________
> >>>>>>>>gem5-dev mailing list
> >>>>>>>>gem5-***@gem5.org
> >>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>>>
> >>>>>>>_______________________________________________
> >>>>>>>gem5-dev mailing list
> >>>>>>>gem5-***@gem5.org
> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
> >>>>are
> >>>>>>confidential and may also be privileged. If you are not the
> >>intended
> >>>>>>recipient, please notify the sender immediately and do not disclose
> >>>>the
> >>>>>>contents to any other person, use it for any purpose, or store or
> >>>>copy
> >>>>>>the
> >>>>>>information in any medium. Thank you.
> >>>>>>
> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >>>>>>Registered in England & Wales, Company No: 2557590
> >>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> >>CB1
> >>>>>>9NJ,
> >>>>>>Registered in England & Wales, Company No: 2548782
> >>>>>>_______________________________________________
> >>>>>>gem5-dev mailing list
> >>>>>>gem5-***@gem5.org
> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>>>
> >>>>>_______________________________________________
> >>>>>gem5-dev mailing list
> >>>>>gem5-***@gem5.org
> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>
> >>>>
> >>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
> >>are
> >>>>confidential and may also be privileged. If you are not the intended
> >>>>recipient, please notify the sender immediately and do not disclose
> >>the
> >>>>contents to any other person, use it for any purpose, or store or copy
> >>>>the
> >>>>information in any medium. Thank you.
> >>>>
> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >>>>Registered in England & Wales, Company No: 2557590
> >>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>>>9NJ,
> >>>>Registered in England & Wales, Company No: 2548782
> >>>>_______________________________________________
> >>>>gem5-dev mailing list
> >>>>gem5-***@gem5.org
> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>>>
> >>>_______________________________________________
> >>>gem5-dev mailing list
> >>>gem5-***@gem5.org
> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >>
> >>
> >>
> >>-- IMPORTANT NOTICE: The contents of this email and any attachments are
> >>confidential and may also be privileged. If you are not the intended
> >>recipient, please notify the sender immediately and do not disclose the
> >>contents to any other person, use it for any purpose, or store or copy
> >>the
> >>information in any medium. Thank you.
> >>
> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >>Registered in England & Wales, Company No: 2557590
> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >>Registered in England & Wales, Company No: 2548782
> >>_______________________________________________
> >>gem5-dev mailing list
> >>gem5-***@gem5.org
> >>http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >_______________________________________________
> >gem5-dev mailing list
> >gem5-***@gem5.org
> >http://m5sim.org/mailman/listinfo/gem5-dev
>
>
>
>
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabor Dozsa
2015-07-07 12:38:24 UTC
Permalink
Hi Mohammad and all,

gem5 processes may restore at a different tick from a checkpoint but the
next periodic sync will happen at the same tick in all gem5. A receive
tick of a packet cannot fall into the current quantum so every packet can
get scheduled for receive properly even if a checkpoint/restore happens
during a quantum.

Regarding your multi-threaded dual config, my understanding is that
EtherLink is not prepared to work with multi threading as it lacks thread
safety. The multiple event queues/threads config only works if the systems
are independent.

One possible way to fix that is to provide a "multi-thread” based
implementation for MultiIface ;-)

- Gabor

On 7/7/15, 6:29 AM, "Mohammad Alian" <***@wisc.edu> wrote:

>Gabor- My concern about unsync checkpoint is that when you restore from an
>unsync checkpoint, you'll have gem5 processes that each is running in
>different tick. Then how do you handle accurate delivery of packets
>between
>these gem5 processes? It will also make it harder to integrate
>multi/pd-gem5 with current multi-threaded gem5. The problem with sync
>checkpoint is that you cannot exactly take checkpoint at ROI, but I think
>unsync checkpoint introduces some other problems. Considering the
>necessary
>warmup period before starting stat collection, I think we don't need to
>exactly pinpoint the ROI. Please correct me if I'm wrong.
>
>I'm trying to run a multi-threaded experiment with pd-gem5, but I got an
>error when I tried to partition dual mode simulation on two threads. I
>posted that in gem5 users mailing list. Please help me on that if you can.
>
>Thank you,
>Mohammad
>
>On Mon, Jul 6, 2015 at 11:45 AM, Gabor Dozsa <***@arm.com> wrote:
>
>> Thank you Steve for the detailed elaboration on the issues.
>>
>>
>> Regarding the “unsynchronized checkpoints”, the terminology might be a
>>bit
>> confusing. In fact, we always need to do a global synchronization among
>> the gem5 processes before taking a distributed checkpoint (in order to
>> avoid in-flight packets). The global synchronization here means that
>>each
>> gem5 has to suspend the simulation and wait until every in-flight
>>packets
>> arrives (and is stored) at the destination gem5 process. If that global
>> synchronization step happens at the same simulated tick in each gem5
>>then
>> the we call the checkpoint “synchronous” otherwise it is an
>>“asynchronous”
>> checkpoint.
>>
>> In the MPI application example I mentioned before the checkpoint should
>>be
>> triggered as soon as the “slowest” MPI process reaches the
>>MPI_barrier().
>> The problem is that the “slowest” MPI process usually does not reach the
>> MPI_barrier() right at the end of the current quantum. If we let the
>> simulation continue until the quantum completes (to ensure that the
>> checkpoint is taken at the same simulated tick in each gem5) then the
>>MPI
>> processes will complete the MPI_barrier and start executing the ROI code
>> already.
>>
>> Regarding the integration of multi-threaded/multi-host simulation,
>> multi-gem5 does not support fine grain simulation of hierarchical
>>switches
>> (or any other network topologies except a single crossbar) or multiple
>> synchronization domains currently.
>>
>> However, I'm a bit confused about your statement that you don’t see
>>value
>> in ever building a shared-memory transport for MultiIface. MultiIface in
>> my view is just an abstract interface for “multi-(ether)-link" objects
>> which are link objects for connecting multiple (i.e. more than two)
>> systems. It aims to encapsulate the API necessary for any Link object
>> in a any multi-system configuration - provided that we partition the
>> systems across network links during run time.
>>
>> An orthogonal issue is if we want to include a simple crossbar switch
>> model in a MultiIface implementation or we want to provide a
>>‘standalone'
>> fine grain model for the switch (e.g. the pd-gem5 approach).
>>
>> Thanks,
>> - Gabor
>>
>>
>>
>> On 7/3/15, 7:33 PM, "Steve Reinhardt" <***@gmail.com> wrote:
>>
>> >Thanks Mohammad & Gabor for the responses.
>> >
>> >I think there's still some misunderstanding on what I mean by the
>> >integration of multi-threaded and multi-host simulation based on
>>Gabor's
>> >response above and Andreas's response in the other thread.
>> >
>> >The primary example scenario I'm proposing is as Mohammad described:
>> >within
>> >each host node, we're simulating an entire rack + top-of-rack switch
>>in a
>> >single gem5 process, with separate event queues/threads being used to
>> >parallelize across nodes within the rack. The switch may or may not be
>>on
>> >its own thread as well. The synchronization among the threads only
>>needs
>> >to be at the granularity of the intra-rack network latency.
>> >
>> >Now we want to expand this by using pd-gem5 or multi-gem5 to
>>parallelize
>> >multiple of these rack-level simulations across hosts, so we can
>>simulate
>> >a
>> >whole row of a datacenter. Only the uplinks from the TOR switches
>>would
>> >need to go over sockets between processes, and the switch being
>>modeled by
>> >pd-gem5 or multi-gem5 would be the end-of-row switch. The
>>synchronization
>> >delay among the multiple gem5 processes would be based on the
>>inter-rack
>> >latency.
>> >
>> >So the basic question is: Is this feasible with pd-gem5 / multi-gem5,
>>and
>> >if not, how much work would it take to make it so?
>> >
>> >However, my larger point is that I still don't see value in ever
>>building
>> >a
>> >shared-memory transport for MultiIface. For this model, there is
>>clearly
>> >no
>> >need for it. Things get more complicated if we want to do something
>>like
>> >have N nodes connected to a single switch and split that over two hosts
>> >(with N/2 nodes simulated on each), but even in that case, I think
>>it's a
>> >better idea to make the switch model deal with having half of its links
>> >internal and half external (since we already want the same model to
>>work
>> >in
>> >both the all-internal and all-external cases). Not that I'm worried
>>that
>> >someone is about to go off and build this shared-memory transport, but
>>I
>> >think it's important to reach an understanding here, since it's
>> >fundamental
>> >to defining the strategic relationship between these capabilities going
>> >forward.
>> >
>> >Stepping back a little further, it would be nice to have a model that
>>is
>> >as
>> >generic as the multi-threading model, where it's really just a matter
>>of
>> >taking a simulation, partitioning the components among the threads, and
>> >setting the synchronization quantum, and it works. Of course, even with
>> >the
>> >multi-threaded model, if you don't choose your partitioning and your
>> >quantum wisely, you're not going to get much speedup or a deterministic
>> >simulation, but the fundamental implementation is oblivious to that.
>>I'm
>> >not saying we really need to go all the way to this extreme---it's
>>pretty
>> >reasonable to assume that no one in the near future will want to
>>partition
>> >across hosts anywhere other than on a simulated network link---but I
>>think
>> >we should keep this ideal in mind as a guiding principle as we choose
>>how
>> >to go forward from here.
>> >
>> >This ties in to my point #4, which is that if we're really building a
>> >mechanism to partition a simulation across multiple hosts, then you
>>should
>> >be able to run the same simulation in a single gem5 process and get the
>> >same results. I think this is the strength of pd-gem5; correspondingly
>>the
>> >main weakness of multi-gem5 is that it architecturally feels more like
>> >tying together a set of mostly independent gem5 simulations than like
>> >partitioning a single gem5 simulation. (Of course, they both end up at
>> >roughly the same point in the middle.)
>> >
>> >On the flip side, multi-gem5 has some clear advantages in terms of the
>> >better separation of the communication layer (and I can imagine it
>>being
>> >very useful to port to MPI and perhaps some RDMA API for InfiniBand
>> >clusters). Also I think the integrated sockets for communication and
>> >syncrhonization are the superior design; while the separate sockets
>>used
>> >by
>> >pd-gem5 may only very rarely cause problems, I agree with Andreas that
>> >that's not good enough, and I don't see any real advantage either---if
>>you
>> >have to flush the data sockets (or wait for them to drain) before
>> >synchronizing, then you might as well just have the synchronization
>> >messages queue up behind the data messages.
>> >
>> >Regarding unsynchronized checkpoints: Thanks for the example, but I'm
>> >still
>> >a little confused. If all the processes are about to execute an
>> >MPI_Barrier(), doesn't that mean they'll all be synchronized shortly
>> >anyway? So what's the harm until waiting until they're synchronized and
>> >then checkpointing?
>> >
>> >Regarding the simulation of non-Ethernet networks: I agree that the
>> >biggest
>> >obstacle to this is the lack of generality of the current gem5 network
>> >components. I tried to take a step toward supporting other link types
>>two
>> >years ago (see http://reviews.gem5.org/r/1922) but someone shot me down
>> >;).
>> >We shouldn't try and fix that here, but we should also consciously try
>>not
>> >to make it any worse...
>> >
>> >Thanks for reading all the way to the end!
>> >
>> >Steve
>> >
>> >
>> >On Fri, Jul 3, 2015 at 7:11 AM Gabor Dozsa <***@arm.com> wrote:
>> >
>> >>Hi all,
>> >>
>> >>Thank you Steve for the thorough review.
>> >>
>> >>First, let me elaborate a bit on Andreas’s 3rd point about
>> >>non-synchronous
>> >>checkpoints. Let’s assume that we aim to simulate MPI applications
>>(HPC
>> >>workloads). The ROI in an MPI application is typically starts with a
>> >>global MPI_Barrier() call. We want to take the checkpoint when *every*
>> >>gem5 process is reached that MPI_Barrier() in the simulated code but
>> >>that
>> >>may not happen at the same tick in each gem5 (due to load imbalance
>> >>among
>> >>the simulated nodes). That’s why multi-gem5 implements the
>> >>non-synchronous
>> >>checkpoint support.
>> >>
>> >>My answers to your questions are as follows.
>> >>
>> >>1. The only change necessary to use multi-gem5 with a non Ethernet
>> >>(simulated) network is to replace the Ethernet packet type with
>>another
>> >>packet type in MultiIface.
>> >>In fact, the first implementation of MultiIface was a template
>> >>that took EthPacketData as parameter because I plan to support
>>different
>> >>network types. When I realized that currently only Ethernet is
>>supported
>> >>by gem5 I dropped the template param to keep the implementation
>> >>simpler. I
>> >>have also realized in the meantime that the right approach would
>> >>probably
>> >>be to create a pure virtual ‘base' class for network packets from
>>which
>> >>Ethernet (and other types of) packets could be derived. Then
>>MultiIface
>> >>could simply use that base class to provide support for different
>> >>network
>> >>types. The interface provided by the base packet class could be very
>> >>simple. Beside the total size() of the packet, multi-gem5 only needs a
>> >>method to ‘extract' the source/destination address. Those addresses
>>are
>> >>used in MultiIface as opaque byte arrays so they are quite network
>>type
>> >>agnostic already.
>> >>
>> >>2. That’s right, we have designed the MultiIface/TCPIface split with
>> >>different underlaying messaging systems in mind.
>> >>
>> >>3. Multi-gem5 can work together with multi-threaded/multi-event-queue
>> >>gem5
>> >>configs. The current TCPIface/tcp_server components would still use
>> >>sockets to send around the packets. So it is possible to put together
>>a
>> >>multi-gem5 simulation where each gem5 process has multiple event
>>queues
>> >>(and an independent simulation thread per event queue) but all the
>> >>simulated Ethernet links would use sockets to forward every Ethernet
>> >>packet to the tcp_server.
>> >>
>> >>If someone wanted to run only a single gem5 process to simulate an
>> >>entire
>> >>cluster (using one thread/event-queue per cluster node) then the
>>current
>> >>multi-gem5 implementation using sockets/tcp_server is not optimal. In
>> >>that
>> >>case, a better solution would be to provide a shared memory based
>> >>implementation of the MultiIface virtual communication methods
>> >>sendRaw()/recvRaw()/syncRaw() (i.e. a shared memory equivalent of
>> >>TCPIface). In that implementation, the entire discrete tcp_sever
>> >>component
>> >>could be replaced with a shared data structure.
>> >>
>> >>4. You are right, the current implementation does not make it possible
>> >>to
>> >>construct an equivalent single-process simulation model for a
>>multi-gem5
>> >>run. However, a possible solution is a shared memory based
>> >>implementation
>> >>of the MultiIface virtual communication methods just as I described in
>> >>the
>> >>previous paragraph. The same implementation could then work with both
>> >>multi-threaded/multi-event-queues and single-thread/single-event-queue
>> >>gem5 configs.
>> >>
>> >>Thanks,
>> >>- Gabor
>> >>
>> >>On 7/2/15, 7:20 PM, "Steve Reinhardt" <***@gmail.com> wrote:
>> >>
>> >>>Hi everyone,
>> >>>
>> >>>Sorry for taking so long to engage. This is a great development and I
>> >>>think
>> >>>both these patches are terrific contributions. Thanks to Mohammad,
>> >>Gabor,
>> >>>and everyone else involved.
>> >>>
>> >>>I agree with Andreas that we should start with some top-level goals &
>> >>>assumptions, agree on those, and then we can sort out the detailed
>> >>issues
>> >>>based on a consistent view.
>> >>>
>> >>>I definitely agree with Andreas's first two points. The third one
>> >>seems a
>> >>>little surprising; I'd like to hear more about the motivation before
>> >>>expressing an opinion. I can see where non-synchronous checkpointing
>> >>could
>> >>>be useful, but it's also clear from the associated patch that it's
>>not
>> >>>trivial to implement either. How much would be lost by requiring a
>> >>>synchronization before a checkpoint?
>> >>>
>> >>>From my personal perspective, I would like to see whatever we do here
>> >>be a
>> >>>first step toward a more general distributed simulation platform.
>>Both
>> >>of
>> >>>these patches seem pretty Ethernet-centric in different ways. This is
>> >>not
>> >>>terrible; part of the problem is that gem5's current internal
>> >>networking
>> >>>support is already overly Ethernet-centric IMO. But it would be nice
>>to
>> >>>avoid baking that in even further. Rather than assume I have
>>understood
>> >>>all
>> >>>the code completely, I'll phrase things in the form of questions, and
>> >>>people can comment on how those questions would be answered in the
>> >>context
>> >>>of the two different approaches.
>> >>>
>> >>>1. How much effort would be required to simulate a non-Ethernet
>> >>network?
>> >>>My
>> >>>impression is that pd-gem5 has a leg up here, since a gem5 switch
>>model
>> >>>for
>> >>>a non-Ethernet network (which you'd have to write anyway if you were
>> >>>simulating a different network) could be used in place of the current
>> >>>Ethernet switch, where for multi-gem5 I think that the
>> >>>util/multi//tcp_server.cc code would have to be modified (i.e.,
>> >>there'd be
>> >>>additional work above and beyond what you'd need to get the network
>> >>>modeled
>> >>>in base gem5).
>> >>>
>> >>>2. How much effort is required to run on a non-Ethernet network (or
>> >>>equivalently using a non-sockets API)? The MultiIface/TCPIface split
>> >>in
>> >>>the multi-gem5 code looks like it addresses this nicely, but pd-gem5
>> >>seems
>> >>>pretty tied to an Ethernet host fabric.
>> >>>
>> >>>3. Do both of these patches work with the existing multithreaded
>> >>>multiple-event-queue simulation? I think multi-gem5 does (though it
>> >>would
>> >>>be nice to have a confirmation), but it's not clear about pd-gem5. I
>> >>don't
>> >>>see a benefit to having multiple gem5 processes on a single host vs.
>>a
>> >>>single multithreaded gem5 process using the existing support. I think
>> >>this
>> >>>could be particularly valuable with a hierarchical network; e.g.,
>> >>maybe I
>> >>>would want to model a rack in multithreaded mode on a single
>>multicore
>> >>>server, then use pd-gem5 or multi-gem5 to build up a simulation of
>> >>>multiple
>> >>>racks. Would this work out of the box with either of these patches,
>> >>and if
>> >>>not, what would need to be done?
>> >>>
>> >>>4. Is it possible to construct a single-process simulation model
>>that's
>> >>>identical to the distributed simulation? It would be very valuable
>>for
>> >>>verification to be able to take a single simulation run and do it
>>both
>> >>>within a single process and also across multiple processes and verify
>> >>that
>> >>>identical results are achieved. This seems like a big drawback to the
>> >>>multi-gem5 tcp_server approach, IMO.
>> >>>
>> >>>I'm definitely not saying that all these issues need to be resolved
>> >>before
>> >>>anything gets committed, but if we can agree that these are valid
>> >>goals,
>> >>>then we can evaluate detailed issues based on whether they move us
>> >>toward
>> >>>or away from those goals.
>> >>>
>> >>>Thanks,
>> >>>
>> >>>Steve
>> >>>
>> >>>
>> >>>On Thu, Jul 2, 2015 at 8:34 AM Andreas Hansson
>> >><***@arm.com>
>> >>>wrote:
>> >>>
>> >>>>Hi all,
>> >>>>
>> >>>>I think we need to up-level this a bit. From our perspective (and I
>> >>>>suspect in general):
>> >>>>
>> >>>>1. Robustness is important. Having a design that _may_ break,
>>however
>> >>>>unlikely is simply not an option.
>> >>>>
>> >>>>2. Performance and scaling is important. We can compare actual
>>numbers
>> >>>>here, and I am fairly sure the two solutions are on par. Let’s
>> >>quantify
>> >>>>that though.
>> >>>>
>> >>>>3. Checkpointing must not rely on synchronicity. It is vital for
>> >>several
>> >>>>workloads that we can checkpoint the various gem5 instances at
>> >>different
>> >>>>Ticks (due to the way the workloads are constructed).
>> >>>>
>> >>>>Andreas
>> >>>>
>> >>>>On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
>> >>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
>> >>>>
>> >>>>>Thanks Gabor for the reply.
>> >>>>>
>> >>>>>I feel this conversation is useful as we can find out pros/cons of
>> >>each
>> >>>>>design.
>> >>>>>Please find my response in-lined below.
>> >>>>>
>> >>>>>Thank you,
>> >>>>>Mohammad
>> >>>>>
>> >>>>>On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com>
>> >>>>wrote:
>> >>>>>
>> >>>>>>Hi All,
>> >>>>>>
>> >>>>>>Sorry for the missing indentation in my previous e-mail! (This was
>> >>my
>> >>>>>>first e-mail to the dev-list so I could not simply use “reply").
>> >>>>Below
>> >>>>>>is
>> >>>>>>the same message, hopefully in more readable form.
>> >>>>>>
>> >>>>>>====================================
>> >>>>>>
>> >>>>>>Hi All,
>> >>>>>>
>> >>>>>>Thank you Mohammad for your elaboration on the issues!
>> >>>>>>
>> >>>>>>I have written most of the multi-gem5 patch so let me add some
>>more
>> >>>>>>clarifications and answer to your concerns. My comments are
>>inline
>> >>>>>>below.
>> >>>>>>
>> >>>>>>Thanks,
>> >>>>>>- Gabor
>> >>>>>>
>> >>>>>>On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
>> >>>>>>
>> >>>>>>>Hi All,
>> >>>>>>>
>> >>>>>>>Curtis-Thank you for listing some of the differences. I was
>> >>waiting
>> >>>>for
>> >>>>>>>the
>> >>>>>>>completed multi-gem5 patch before I send my review. Please see my
>> >>>>>>inline
>> >>>>>>>response below. I¹ve addressed the concerns that you¹ve raised.
>> >>>>Also,
>> >>>>>>I¹ve
>> >>>>>>>added a bit more to the comparison.
>> >>>>>>>
>> >>>>>>>-* Synchronization.
>> >>>>>>>
>> >>>>>>>pd-gem5 implements this in Python (not a problem in itself;
>> >>>>>>aesthetically
>> >>>>>>>
>> >>>>>>>this is nice, but...). The issue is that pd-gem5's data packets
>> >>and
>> >>>>>>>
>> >>>>>>>barrier messages travel over different sockets. Since pd-gem5
>> >>could
>> >>>>>>see
>> >>>>>>>
>> >>>>>>>data packets passing synchronization barriers, it could create an
>> >>>>>>>
>> >>>>>>>inconsistent checkpoint.
>> >>>>>>>
>> >>>>>>>multi-gem5's synchronization is implemented in C++ using sync
>> >>>>events,
>> >>>>>>but
>> >>>>>>>
>> >>>>>>>more importantly, the messages queue up in the same stream and so
>> >>>>>>cannot
>> >>>>>>>
>> >>>>>>>have the issue just described. (Event ordering is often crucial
>> >>in
>> >>>>>>>
>> >>>>>>>snapshot protocols.) Therefore we feel that multi-gem5 is a more
>> >>>>robust
>> >>>>>>>
>> >>>>>>>solution in this respect.
>> >>>>>>>
>> >>>>>>>Each packet in pd-gem5 has a time-stamp. So even if data packets
>> >>>>pass
>> >>>>>>>synchronization barriers (in another word data packets arrive
>> >>early
>> >>>>at
>> >>>>>>the
>> >>>>>>>destination node), destination node process packets based on
>>their
>> >>>>>>>timestamp. Actually allowing data packets to pass sync barriers
>> >>is a
>> >>>>>>nice
>> >>>>>>>feature that can reduce the likelihood of late packet reception.
>> >>>>>>Ordering
>> >>>>>>>of data messages that flow over pd-gem5 nodes is also preserved
>>in
>> >>>>>>pd-gem5
>> >>>>>>>implementation.
>> >>>>>>
>> >>>>>>This seems to be a misunderstanding. Maybe the wording was not
>> >>>>precise
>> >>>>>>before.The problem is not a data packet that “passing" a sync
>> >>barrier
>> >>>>>>but the other way around, a sync barrier that can pass a data
>> >>packet
>> >>>>>>(e.g. while the data packet is waiting in the host operating
>>system
>> >>>>>>socket layer). If that happens, the packet will arrive later than
>> >>it
>> >>>>>>was
>> >>>>>>supposed to and it may miss the computed receive tick.
>> >>>>>>
>> >>>>>>For instance, let’s assume that the quantum coincides with the
>> >>>>simulated
>> >>>>>>Ether link delay. (This is the optimal choice of quantum to
>> >>minimize
>> >>>>the
>> >>>>>>number of sync barriers.) If a data packet is sent right at the
>> >>>>>>beginning
>> >>>>>>of a quantum then this packet must arrive at the destination gem5
>> >>>>>>process
>> >>>>>>within the same quantum in order not to miss its receive tick at
>> >>the
>> >>>>>>very
>> >>>>>>beginning of the next quantum. If the sync barrier can pass the
>> >>data
>> >>>>>>packet
>> >>>>>>then the data packet may arrive only during the next quantum (or
>> >>in
>> >>>>>>extreme conditions even later than that) so when it arrives the
>> >>>>receiver
>> >>>>>>gem5 may pass already the receive tick.
>> >>>>>>
>> >>>>>>This argument makes more sense than the previous one. Note that
>> >>gem5
>> >>>>is
>> >>>>>>a
>> >>>>>cycle accurate simulator and it runs orders of magnitude slower
>>that
>> >>>>real
>> >>>>>hardware. So it's almost impossible that the flight time of packet
>> >>>>through
>> >>>>>real network turns to be more that simulation time of one quantum.
>>We
>> >>>>ran
>> >>>>>a
>> >>>>>set of experiments just for this purpose: with quantum size equal
>>to
>> >>>>>etherlink delay, we never got any late arrival violation (what you
>> >>>>>described) for full NAS benchmarks suit (please refer to the
>>paper).
>> >>>>>
>> >>>>>multi-gem5 is optimized for a case that almost never happens! and
>> >>>>>scarifying speedup for no gain.
>> >>>>>
>> >>>>>
>> >>>>>>Time-stamping does help with this issue. Also, if a data packet is
>> >>>>>>waiting
>> >>>>>>in the host operating system socket layer when the simulation
>> >>thread
>> >>>>>>exits
>> >>>>>>to python to complete the next sync barrier then the packet will
>> >>>>not go
>> >>>>>>into the checkpoint that may follow that sync barrier.
>> >>>>>>
>> >>>>>>That's a good point. Current pd-gem5 checkpointing mechanism might
>> >>>>miss
>> >>>>>packets that have been sent during previous quantum and are waiting
>> >>in
>> >>>>OS
>> >>>>>socket buffer. I should add some code inside ethertap serialization
>> >>>>>function to drain ethertap socket before writing checkpoint. I will
>> >>>>update
>> >>>>>pd-gem5 patch accordingly.
>> >>>>>
>> >>>>>>
>> >>>>>>>What you mentioned as an advantage for multi-gem5 is actually a
>> >>key
>> >>>>>>>disadvantage: buffering sync messages behind data packets can add
>> >>>>up to
>> >>>>>>>the
>> >>>>>>>synchronization overhead and slow down simulation significantly.
>> >>>>>>
>> >>>>>>The purpose of sync messages is to make sure that the data packets
>> >>>>>>arrive
>> >>>>>>in time (in terms of simulated time) at the destination so they
>>can
>> >>>>be
>> >>>>>>scheduled for being received at the proper computed tick. Sync
>> >>>>messages
>> >>>>>>also make sure that no data packets are in flight when a sync
>> >>barrier
>> >>>>>>completes before we take a checkpoint. They definitely add
>> >>overhead
>> >>>>for
>> >>>>>>the simulation but they are necessary for the correctness of the
>> >>>>>>simulation.
>> >>>>>>
>> >>>>>>The receive thread in multi-gem5 reads out packets from the socket
>> >>in
>> >>>>>>parallel with the simulation thread so packets normally will not
>>be
>> >>>>>>"queueing up” before a sync barrier message. There is definitely
>> >>>>room
>> >>>>>>for improvements in the current implementation for reducing the
>> >>>>>>synchronization overhead but that is likely true for pd-gem5, too.
>> >>>>>>The important thing here is that the solution must provide
>> >>>>correctness
>> >>>>>>(robustness) first.
>> >>>>>>
>> >>>>>>pd-gem5 provides correctness. Please read my previous comment. The
>> >>>>whole
>> >>>>>purpose of multi/pd-gem5 is to parallelize simulation with minimal
>> >>>>>overhead
>> >>>>>and gain speedup. If you fail to do so, nobody will use your tool.
>> >>>>>
>> >>>>>
>> >>>>>>>Also,
>> >>>>>>>multi-gem5 send huge sized messages (multiHeaderPkt) through
>> >>>>network to
>> >>>>>>>perform each synchronization point, which increases
>> >>synchronization
>> >>>>>>>overhead further. In pd-gem5, we choose to send just one
>>character
>> >>>>as
>> >>>>>>sync
>> >>>>>>>message through a separate socket to reduce synchronization
>> >>>>overhead.
>> >>>>>>
>> >>>>>>The TCP/IP message size is unlikely the bottleneck here.
>>Multi-gem5
>> >>>>will
>> >>>>>>send ~50 bytes more in a sync barrier message than pd-gem5 but
>>that
>> >>>>>>bigger
>> >>>>>>sync message still fits into a single ethernet frame on the wire.
>> >>The
>> >>>>>>end-to-end latency overhead that is caused by 50 bytes extra
>> >>payload
>> >>>>for
>> >>>>>>a small single frame TCP/IP message is likely to fall into the
>> >>>>“noise"
>> >>>>>>category if one tries to measure it in a real cluster.
>> >>>>>>
>> >>>>>>You should prove your hypothesis experimentally. Each gem5 process
>> >>>>>send/receive sync messages at the end of every quantum. Say you are
>> >>>>>simulating "N" node computer cluster with "M" different
>> >>configuration.
>> >>>>>Then
>> >>>>>you will have N*M gem5 processes that send/receive these 50 Bytes
>>(it
>> >>>>>think
>> >>>>>it's more) extra data at the same time over network ...
>> >>>>>
>> >>>>>Furthermore, multi-gem5 send a header before each data message.
>> >>>>Comparing
>> >>>>>with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12
>>least
>> >>>>>significant digits of the Tick) to each data packet. I don't know
>> >>>>exactly
>> >>>>>how large are these "MultiHeaderPkt", but it just has two Tick
>>field
>> >>>>that
>> >>>>>each is 64 Bytes! Also, header packets are separate TCP packets, so
>> >>you
>> >>>>>pay
>> >>>>>for sending two separate packets for each data packet. And worst,
>>you
>> >>>>>serialize all of these with sync messages.
>> >>>>>
>> >>>>>
>> >>>>>>>
>> >>>>>>>* Packet handling.
>> >>>>>>>
>> >>>>>>>pd-gem5 uses EtherTap for data packets but changed the polling
>> >>>>>>mechanism
>> >>>>>>>
>> >>>>>>>to go through the main event queue. Since this rate is actually
>> >>>>linked
>> >>>>>>>
>> >>>>>>>with simulator progress, it cannot guarantee that the packets are
>> >>>>>>>serviced
>> >>>>>>>
>> >>>>>>>at regular intervals of real time. This can lead to packets
>> >>>>queueing
>> >>>>>>up
>> >>>>>>>
>> >>>>>>>which would contribute to the synchronization issues mentioned
>> >>>>above.
>> >>>>>>>
>> >>>>>>>multi-gem5 uses plain sockets with separate receive threads and
>>so
>> >>>>does
>> >>>>>>>not
>> >>>>>>>
>> >>>>>>>have this issue.
>> >>>>>>>
>> >>>>>>>I think again you are pointing to your first concern that I¹ve
>> >>>>>>explained
>> >>>>>>>above. Packets that have queued up in EtherTap socket, will be
>> >>>>>>processed
>> >>>>>>>and delivered to simulation environment at the beginning of next
>> >>>>>>>simulation
>> >>>>>>>quantum.
>> >>>>>>>
>> >>>>>>>Please notice that multi-gem5 introduces a new simObjects to
>> >>>>interface
>> >>>>>>>simulation environment to real world which is redundant. This
>> >>>>>>>functionality
>> >>>>>>>is already there by EtherTap.
>> >>>>>>
>> >>>>>>Except that the EtherTap solution does not provide a correct
>> >>(robust)
>> >>>>>>solution for the synchronization problem.
>> >>>>>>
>> >>>>>>Please read my first/second comments.
>> >>>>>
>> >>>>>
>> >>>>>>>
>> >>>>>>>* Checkpoint accuracy.
>> >>>>>>>
>> >>>>>>>A user would like to have a checkpoint at precisely the time the
>> >>>>>>>
>> >>>>>>>'m5 checkpoint' operation is executed so as to not miss any of
>>the
>> >>>>>>>
>> >>>>>>>area of interest in his application.
>> >>>>>>>
>> >>>>>>>pd-gem5 requires that simulation finish the current quantum
>> >>>>>>>
>> >>>>>>>before checkpointing, so it cannot provide this.
>> >>>>>>>
>> >>>>>>>(Shortening the quantum can help, but usually the snapshot is
>> >>being
>> >>>>>>taken
>> >>>>>>>
>> >>>>>>>while 'fast-forwarding', i.e. simulating as fast as possible,
>> >>which
>> >>>>>>would
>> >>>>>>>
>> >>>>>>>motivate a longer quantum.)
>> >>>>>>>
>> >>>>>>>multi-gem5 can enter the drain cycle immediately upon receiving a
>> >>>>>>>
>> >>>>>>>checkpoint request. We find this accuracy highly desirable.
>> >>>>>>>
>> >>>>>>>It¹s true that if you have a large quantum size then there would
>> >>be
>> >>>>>>some
>> >>>>>>>discrepancy between the m5_ckpt instruction tick and the actual
>> >>dump
>> >>>>>>tick.
>> >>>>>>>Based on multi-gem5 code, my understanding is that you send async
>> >>>>>>>checkpoint message as soon as one of the gem5 processes encounter
>> >>>>>>m5_ckpt
>> >>>>>>>instruction. But I¹m not sure how you fix the aforementioned
>> >>issue,
>> >>>>>>>because
>> >>>>>>>you have to sync all gem5 processes before you start dumping
>> >>>>>>checkpoint,
>> >>>>>>>which necessitate a global synchronization beforehand.
>> >>>>>>
>> >>>>>>In multi-gem5, the gem5 process who encounters the m5_ckpt
>> >>>>instruction
>> >>>>>>sends out an async checkpoint notification for the peer gem5
>> >>>>processes
>> >>>>>>and
>> >>>>>>then it starts the draining immediately (at the same tick). So
>>the
>> >>>>>>checkpoint will be taken at the exact tick form the initiator
>> >>process
>> >>>>>>point of view. The global synchronisation with the peer processes
>> >>>>takes
>> >>>>>>place while the initiator process is still waiting at the same
>>tick
>> >>>>(i.e
>> >>>>>>the simulation thread is suspended). However, the receiver thread
>> >>>>>>Continues reading out the socket - while waiting for the global
>> >>sync
>> >>>>to
>> >>>>>>complete- to make sure that in-flight data packets from peer gem5
>> >>>>>>processes
>> >>>>>>are stored properly and saved into the checkpoint.
>> >>>>>>
>> >>>>>>
>> >>>>>So you mean multi-gem5 ends up with having gem5 processes with
>> >>>>different
>> >>>>>ticks after checkpoint? In pd-gem5 we make sure that all gem5
>> >>processes
>> >>>>>start dumping checkpoint at the same tick. Are you sure that this
>>is
>> >>>>>correct to have each gem5 process dump checkpoint at different
>> >>ticks???
>> >>>>>
>> >>>>>I don't think this a correct checkpointing design. However, if you
>> >>>>feel it
>> >>>>>is correct, I can change a couple of lines in "Simulation.py" and
>> >>>>barrier
>> >>>>>scripts to implement the same functionality in pd-gem5. One thing
>> >>that
>> >>>>you
>> >>>>>are obsessed about is to make sure that there is no in-flight
>>packets
>> >>>>>while
>> >>>>>we start dumping checkpoint, and you have all these complex
>> >>mechanisms
>> >>>>in
>> >>>>>place to insure that! I think you can 99.99999% make sure that
>>there
>> >>>>is no
>> >>>>>in-flight packet by waiting for 1 second after all gem5 processes
>> >>>>finished
>> >>>>>their quantum simulation and then dump checkpoint. Do you really
>> >>think
>> >>>>>that
>> >>>>>delivering a tcp packet would take more than 1 second in today's
>> >>>>systems!?
>> >>>>>Always go for simple solutions ...
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>>>>
>> >>>>>>>By the way, we have a fix for this issue by introducing a new m5
>> >>>>pseudo
>> >>>>>>>instruction.
>> >>>>>>
>> >>>>>>I fail to see how a new pseudo instruction can solve the problem
>>of
>> >>>>>>completing the full quantum in pd-gem5 before a checkpoint can be
>> >>>>taken.
>> >>>>>>Could you please elaborate on that?
>> >>>>>>
>> >>>>>>As we take checkpoint while fast-forwarding and it is likely that
>> >>we
>> >>>>>>relax
>> >>>>>synchronization for speedup purpose, a new pseudo instruction that
>> >>can
>> >>>>set
>> >>>>>quantum size (m5_qset) can be helpful. So, one can insert m5_qset
>>in
>> >>>>his
>> >>>>>benchmark source code before entering ROI that contains m5_ckpt to
>> >>>>>decrease
>> >>>>>quantum size beforehand and reduce the discrepancy between m5_ckpt
>> >>tick
>> >>>>>and
>> >>>>>actual checkpoint tick. This is not included in pd-gem5 patch right
>> >>>>now.
>> >>>>>
>> >>>>>
>> >>>>>>>
>> >>>>>>>* Implementation of network topology.
>> >>>>>>>
>> >>>>>>>pd-gem5 uses a separate gem5 process to act as a switch whereas
>> >>>>>>multi-gem5
>> >>>>>>>
>> >>>>>>>uses a standalone packet relay process.
>> >>>>>>>
>> >>>>>>>We haven't measured the overhead of pd-gem5's simulated switch
>> >>yet,
>> >>>>but
>> >>>>>>>
>> >>>>>>>we're confident that our approach is at least as fast and more
>> >>>>>>scalable.
>> >>>>>>>
>> >>>>>>>There is this flexibility in pd-gem5 to simulate a switch box
>> >>>>alongside
>> >>>>>>>one
>> >>>>>>>of the other gem5 processes. However, it might make that gem5
>> >>>>process
>> >>>>>>the
>> >>>>>>>simulation bottleneck. One of the advantages of pd-gem5 over
>> >>>>>>multi-gem5 is
>> >>>>>>>that we use gem5 to simulate a switch box, which allows us to
>> >>model
>> >>>>any
>> >>>>>>>network topology by instantiating several Switch simObjects and
>> >>>>>>>interconnect them with EhterLink in an arbitrary fashion. A
>> >>>>standalone
>> >>>>>>tcp
>> >>>>>>>server just can provide switch functionality (forwarding packets
>> >>to
>> >>>>>>>destinations) and model a star network topology. Furthermore, it
>> >>>>cannot
>> >>>>>>>model various network timings such as queueing delay, congestion,
>> >>>>and
>> >>>>>>>routing latency. Also it has some accuracy issues that I will
>> >>point
>> >>>>out
>> >>>>>>>next.
>> >>>>>>
>> >>>>>>I agree with the complex topology argument. We already mentioned
>> >>that
>> >>>>>>before as an advantage for pd-gem5 from the point of view of
>>future
>> >>>>>>extensions. However, I do not agree that multi-gem5 cannot model
>> >>>>>>queueing
>> >>>>>>delays and congestions. For a simple crossbar switch, it can model
>> >>>>>>queueing
>> >>>>>>delays and congestions, but the receive queues are distributed
>> >>among
>> >>>>the
>> >>>>>>gem5 processes.
>> >>>>>>
>> >>>>>>It's true that you can model queuing delay of a simple crossbar by
>> >>>>>distributing queues across gem5 processes (end points). But to be
>> >>able
>> >>>>to
>> >>>>>do so you have to ensure the ordering of packets that you enqueue
>>in
>> >>>>the
>> >>>>>distributed queues. It is almost impossible without a synchronized
>> >>>>switch
>> >>>>>box. You should have a reorder queue that reorders packets
>> >>dynamically
>> >>>>and
>> >>>>>updates timing parameter for each packet as well. I don't know how
>> >>much
>> >>>>>progress have you had to ensure ordering scheme in multi-gem5 but
>>you
>> >>>>may
>> >>>>>already realized that how complex and error prone it can be. This
>> >>>>argument
>> >>>>>is also related to my next argument for "Broken network timing".
>> >>>>>
>> >>>>>
>> >>>>>>>
>> >>>>>>>* Broken network timing:
>> >>>>>>>
>> >>>>>>>Forwarding packets between gem5 processes using a standalone tcp
>> >>>>server
>> >>>>>>>can
>> >>>>>>>cause reordering between packets that have different source but
>> >>same
>> >>>>>>>destination. It causes inaccurate network timing and worse of
>>all
>> >>>>>>>non-deterministic simulation. pd-gem5 resolve this by reordering
>> >>>>>>packets
>> >>>>>>>at
>> >>>>>>>Switch process and then send them to their destination (it¹s
>> >>>>possible
>> >>>>>>as
>> >>>>>>>switch is synchronized with the rest of the nodes).
>> >>>>>>
>> >>>>>>In multi-gem5, there is always a HeaderPkt that contains some meta
>> >>>>>>information for each data packet. The meta information include the
>> >>>>send
>> >>>>>>tick and the sender rank (i.e. a unique ID of the sender gem5
>> >>>>process).
>> >>>>>>We use those information to define a well defined ordering of
>> >>packets
>> >>>>>>even
>> >>>>>>if packets are arriving at the same receiver from different
>> >>senders.
>> >>>>>>This
>> >>>>>>packet ordering scheme is still being tested so the corresponding
>> >>>>patch
>> >>>>>>is
>> >>>>>>not on the RB yet.
>> >>>>>>
>> >>>>>>Please read my previous comment. The most important part of
>> >>>>>>multi/pd-gem5
>> >>>>>extension is ensuring accurate and deterministic simulation.
>> >>>>>
>> >>>>>
>> >>>>>>>
>> >>>>>>>* Amount of changes
>> >>>>>>>
>> >>>>>>>pd-gem5 introduce different modes in etherlink just to provide
>> >>>>accurate
>> >>>>>>>timing for each component in the network subsystem (NIC, link,
>> >>>>switch)
>> >>>>>>as
>> >>>>>>>well as capability of modeling different network topologies
>>(mesh,
>> >>>>>>ring,
>> >>>>>>>fat tree, etc). To enable a simple functionality, like what
>> >>>>multi-gem5
>> >>>>>>>provides, the amount of changes in gem5 can be limited to
>> >>>>time-stamping
>> >>>>>>>packets and providing synchronization through python scripts.
>> >>>>However,
>> >>>>>>>multi-gem5 re-implements functionalists that are already in gem5.
>> >>>>>>
>> >>>>>>This argument holds only if both implementations are correct
>> >>>>(robust).
>> >>>>>>It
>> >>>>>>still seems to me that pd-gem5 does not provide correctness for
>>the
>> >>>>>>synchronization/checkpointing parts.
>> >>>>>>
>> >>>>>>Again, please read my first comment for correctness of pd-gem5.
>> >>>>>
>> >>>>>
>> >>>>>>>
>> >>>>>>>* Integrating with gem5 mainstream:
>> >>>>>>>
>> >>>>>>>pd-gem5 launch script is written in python which is suited for
>> >>>>>>integration
>> >>>>>>>with gem5 python scripts. However multi-gem5 uses bash script.
>> >>Also,
>> >>>>>>all
>> >>>>>>>source files in pd-gem5 are already parts of gem5 mainstream.
>> >>>>However
>> >>>>>>>multi-gem5 has tcp_server.cc/hh that is a standalone process and
>> >>>>cannot
>> >>>>>>be
>> >>>>>>>part of gem5.
>> >>>>>>
>> >>>>>>The multi-gem5 launch script is simply enough to rely only on the
>> >>>>>>shell. It
>> >>>>>>can obviously be easily re-written in python if that added any
>> >>value.
>> >>>>>>The
>> >>>>>>tcp_server component is only a utility (like the "m5" utility that
>> >>is
>> >>>>>>also
>> >>>>>>part of gem5).
>> >>>>>>
>> >>>>>>The thing is that it's more likely that users want to add some
>> >>>>>functionality to the run-script of multi/pd-gem5. E.g. pd-gem5
>> >>>>run-script
>> >>>>>supports launching simulations using a simulation pool management
>> >>>>>software (
>> >>>>>http://research.cs.wisc.edu/htcondor/). Using python enables users
>>to
>> >>>>>easily add these kind of supports.
>> >>>>>
>> >>>>>
>> >>>>>>
>> >>>>>>Cheers,
>> >>>>>>- Gabor
>> >>>>>>
>> >>>>>>
>> >>>>>>>On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham
>> >>>><***@arm.com>
>> >>>>>>>wrote:
>> >>>>>>>
>> >>>>>>>>Hello everyone,
>> >>>>>>>>We have taken a look at how pd-gem5 compares with multi-gem5.
>> >>>>While
>> >>>>>>>>intending
>> >>>>>>>>to deliver the same functionality, there are some crucial
>> >>>>differences:
>> >>>>>>>>
>> >>>>>>>>* Synchronization.
>> >>>>>>>>
>> >>>>>>>> pd-gem5 implements this in Python (not a problem in itself;
>> >>>>>>>>aesthetically
>> >>>>>>>> this is nice, but...). The issue is that pd-gem5's data
>> >>>>packets
>> >>>>>>and
>> >>>>>>>> barrier messages travel over different sockets. Since
>> >>pd-gem5
>> >>>>>>could
>> >>>>>>>>see
>> >>>>>>>> data packets passing synchronization barriers, it could
>> >>create
>> >>>>an
>> >>>>>>>> inconsistent checkpoint.
>> >>>>>>>>
>> >>>>>>>> multi-gem5's synchronization is implemented in C++ using
>>sync
>> >>>>>>events,
>> >>>>>>>>but
>> >>>>>>>> more importantly, the messages queue up in the same stream
>> >>and
>> >>>>so
>> >>>>>>>>cannot
>> >>>>>>>> have the issue just described. (Event ordering is often
>> >>>>crucial
>> >>>>>>in
>> >>>>>>>> snapshot protocols.) Therefore we feel that multi-gem5 is a
>> >>>>more
>> >>>>>>>>robust
>> >>>>>>>> solution in this respect.
>> >>>>>>>>
>> >>>>>>>>* Packet handling.
>> >>>>>>>>
>> >>>>>>>> pd-gem5 uses EtherTap for data packets but changed the
>> >>polling
>> >>>>>>>>mechanism
>> >>>>>>>> to go through the main event queue. Since this rate is
>> >>>>actually
>> >>>>>>>>linked
>> >>>>>>>> with simulator progress, it cannot guarantee that the
>>packets
>> >>>>are
>> >>>>>>>>serviced
>> >>>>>>>> at regular intervals of real time. This can lead to packets
>> >>>>>>>>queueing up
>> >>>>>>>> which would contribute to the synchronization issues
>> >>mentioned
>> >>>>>>above.
>> >>>>>>>>
>> >>>>>>>> multi-gem5 uses plain sockets with separate receive threads
>> >>>>and so
>> >>>>>>>>does
>> >>>>>>>>not
>> >>>>>>>> have this issue.
>> >>>>>>>>
>> >>>>>>>>* Checkpoint accuracy.
>> >>>>>>>>
>> >>>>>>>> A user would like to have a checkpoint at precisely the time
>> >>the
>> >>>>>>>> 'm5 checkpoint' operation is executed so as to not miss any
>>of
>> >>>>the
>> >>>>>>>> area of interest in his application.
>> >>>>>>>>
>> >>>>>>>> pd-gem5 requires that simulation finish the current quantum
>> >>>>>>>> before checkpointing, so it cannot provide this.
>> >>>>>>>>
>> >>>>>>>> (Shortening the quantum can help, but usually the snapshot is
>> >>>>being
>> >>>>>>>>taken
>> >>>>>>>> while 'fast-forwarding', i.e. simulating as fast as possible,
>> >>>>which
>> >>>>>>>>would
>> >>>>>>>> motivate a longer quantum.)
>> >>>>>>>>
>> >>>>>>>> multi-gem5 can enter the drain cycle immediately upon
>> >>receiving
>> >>>>a
>> >>>>>>>> checkpoint request. We find this accuracy highly desirable.
>> >>>>>>>>
>> >>>>>>>>* Implementation of network topology.
>> >>>>>>>>
>> >>>>>>>> pd-gem5 uses a separate gem5 process to act as a switch
>> >>whereas
>> >>>>>>>>multi-gem5
>> >>>>>>>> uses a standalone packet relay process.
>> >>>>>>>>
>> >>>>>>>> We haven't measured the overhead of pd-gem5's simulated
>>switch
>> >>>>yet,
>> >>>>>>>>but
>> >>>>>>>> we're confident that our approach is at least as fast and
>>more
>> >>>>>>>>scalable.
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>Thanks,
>> >>>>>>>>Curtis
>> >>>>>>>>________________________________________
>> >>>>>>>>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of Mohammad
>> >>>>>>Alian [
>> >>>>>>>>***@wisc.edu]
>> >>>>>>>>Sent: Friday, June 26, 2015 7:37 PM
>> >>>>>>>>To: gem5 Developer List
>> >>>>>>>>Subject: Re: [gem5-dev] pd-gem5: simulating a
>> >>parallel/distributed
>> >>>>>>>>system
>> >>>>>>>>on multiple physical hosts
>> >>>>>>>>
>> >>>>>>>>Hi Anthony,
>> >>>>>>>>
>> >>>>>>>>I think that would be a good option, then I can add pd-gem5
>> >>>>>>>>functionality
>> >>>>>>>>on top of that. Right now I've simplified your implementation.
>> >>>>Also, I
>> >>>>>>>>think I had found some bugs in your patch that I cannot remember
>> >>>>now.
>> >>>>>>If
>> >>>>>>>>you decided to ship EtherSwitch patch, let me know to give you a
>> >>>>>>review
>> >>>>>>>>on
>> >>>>>>>>that.
>> >>>>>>>>
>> >>>>>>>>Thanks,
>> >>>>>>>>Mohammad
>> >>>>>>>>
>> >>>>>>>>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
>> >>>>>>>>***@amd.com> wrote:
>> >>>>>>>>
>> >>>>>>>>>Would it make sense for me to ship the EtherSwitch patch first,
>> >>>>since
>> >>>>>>>>it
>> >>>>>>>>>has utility on its own, and then we can decide which of the
>> >>>>>>>>"multi-gem5"
>> >>>>>>>>>approaches is best, or if it's some combination of both?
>> >>>>>>>>>
>> >>>>>>>>>The only reason I never shipped it was because Steve raised an
>> >>>>issue
>> >>>>>>>>that
>> >>>>>>>>>I didn't have a good alternative for, and didn't have the time
>> >>to
>> >>>>>>look
>> >>>>>>>>into
>> >>>>>>>>>one at that time.
>> >>>>>>>>>________________________________________
>> >>>>>>>>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of
>>Mohammad
>> >>>>>>>>Alian [
>> >>>>>>>>>***@wisc.edu]
>> >>>>>>>>>Sent: Wednesday, June 24, 2015 12:43 PM
>> >>>>>>>>>To: gem5 Developer List
>> >>>>>>>>>Subject: Re: [gem5-dev] pd-gem5: simulating a
>> >>parallel/distributed
>> >>>>>>>>system
>> >>>>>>>>>on multiple physical hosts
>> >>>>>>>>>
>> >>>>>>>>>Hi Andreas,
>> >>>>>>>>>
>> >>>>>>>>>Thanks for the comment.
>> >>>>>>>>>I think the checkpointing support in both works is the same.
>> >>Here
>> >>>>is
>> >>>>>>>>how
>> >>>>>>>>>checkpointing support is implemented in pd-gem5:
>> >>>>>>>>>
>> >>>>>>>>>Whenever one of gem5 processes encounter an m5-checkpoint
>>pseudo
>> >>>>>>>>>instruction, it will send a ³recv-ckpt² signal to the
>> >>>>>>>>>³barrier² process. Then the ³barrier² process sends a
>> >>³take-ckpt²
>> >>>>>>>>signal
>> >>>>>>>>to
>> >>>>>>>>>all the simulated nodes
>> >>>>>>>>>(including the node that encountered m5-checkpoint) at the end
>> >>of
>> >>>>the
>> >>>>>>>>>current simulation quantum. On the reception of
>> >>>>>>>>>³take-ckpt² signal, gem5 processes start dumping check-points.
>> >>>>This
>> >>>>>>>>makes
>> >>>>>>>>>each simulated node dump a checkpoint
>> >>>>>>>>>at the same simulated time point while ensuring there is no
>> >>>>in-flight
>> >>>>>>>>>packets.
>> >>>>>>>>>
>> >>>>>>>>>I believe this is the same as multi-gem5 patch approach for
>> >>>>>>checkpoint
>> >>>>>>>>>support (based on the commit message of
>> >>>>>>>>http://reviews.gem5.org/r/2865/
>> >>>>>>>>).
>> >>>>>>>>>Also, we have tested our mechanism with several benchmarks and
>> >>it
>> >>>>>>>>works.
>> >>>>>>>>As
>> >>>>>>>>>Steve suggested, I'll look into Curtis's patch and try to
>>review
>> >>>>it
>> >>>>>>as
>> >>>>>>>>>well.
>> >>>>>>>>>But as Nilay also mentioned earlier, there are some codes
>> >>missing
>> >>>>in
>> >>>>>>>>>Curtis's patch. I prefer to first run multi-gem5 before
>>starting
>> >>>>to
>> >>>>>>>>review
>> >>>>>>>>>it.
>> >>>>>>>>>
>> >>>>>>>>>Thank you,
>> >>>>>>>>>Mohammad
>> >>>>>>>>>
>> >>>>>>>>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
>> >>>>>>>>***@arm.com>
>> >>>>>>>>>wrote:
>> >>>>>>>>>
>> >>>>>>>>>>Hi Steve,
>> >>>>>>>>>>
>> >>>>>>>>>>Apologies for the confusion. We are on the same page. My point
>> >>is
>> >>>>>>>>that
>> >>>>>>>>we
>> >>>>>>>>>>cannot simply take a little bit of patch A and a little bit of
>> >>>>>>>>patch B.
>> >>>>>>>>>>This change involves a lot of code, and we need to approach
>> >>this
>> >>>>in
>> >>>>>>>>a
>> >>>>>>>>>>structured fashion. My proposal is to do it bottom up, and
>> >>start
>> >>>>by
>> >>>>>>>>>>getting the basic support in place. Since
>> >>>>>>>>>http://reviews.gem5.org/r/2826/
>> >>>>>>>>>>has already been on the review board for a few months, I am
>> >>>>merely
>> >>>>>>>>>>suggesting that the it would be a good start to relate the
>> >>newly
>> >>>>>>>>posted
>> >>>>>>>>>>patches to what is already there.
>> >>>>>>>>>>
>> >>>>>>>>>>Andreas
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
>> >>>>>>>>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com>
>> >>wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>>Hi Andreas,
>> >>>>>>>>>>>
>> >>>>>>>>>>>I'm a little confused by your email---you say you're
>> >>>>fundamentally
>> >>>>>>>>>opposed
>> >>>>>>>>>>>to looking at both patches and picking the best features,
>>then
>> >>>>you
>> >>>>>>>>point
>> >>>>>>>>>>>out that the patches Curtis posted have the feature of better
>> >>>>>>>>>>>checkpointing
>> >>>>>>>>>>>support so we should pick that :).
>> >>>>>>>>>>>
>> >>>>>>>>>>>Obviously we can't just pick patch A from Mohammad's set and
>> >>>>patch
>> >>>>>>>>B
>> >>>>>>>>>from
>> >>>>>>>>>>>Curtis's set and expect them to work together, but I think
>> >>that
>> >>>>>>>>having
>> >>>>>>>>>>>both
>> >>>>>>>>>>>sets of patches available and comparing and contrasting the
>> >>two
>> >>>>>>>>>>>implementations should enable us to get to a single
>> >>>>implementation
>> >>>>>>>>>that's
>> >>>>>>>>>>>the best of both. Someone will have to make the effort of
>> >>>>>>>>integrating
>> >>>>>>>>>the
>> >>>>>>>>>>>better ideas from one set into the other set to create a new
>> >>>>>>>>unified
>> >>>>>>>>set
>> >>>>>>>>>>>of
>> >>>>>>>>>>>patches; (or maybe we commit one set and then integrate the
>> >>>>best of
>> >>>>>>>>the
>> >>>>>>>>>>>other set as patches on top of that), but the first step is
>>to
>> >>>>>>>>identify
>> >>>>>>>>>>>what "the best of both" is. Having Mohammad look at Curtis's
>> >>>>>>>>patches,
>> >>>>>>>>>and
>> >>>>>>>>>>>Curtis (or someone else from ARM) closely examine Mohammad's
>> >>>>>>>>patches
>> >>>>>>>>>would
>> >>>>>>>>>>>be a great start. I intend to review them both, though
>> >>>>>>>>unfortunately
>> >>>>>>>>my
>> >>>>>>>>>>>time has been scarce lately---I'm hoping to squeeze that in
>> >>>>later
>> >>>>>>>>this
>> >>>>>>>>>>>week.
>> >>>>>>>>>>>
>> >>>>>>>>>>>Once we've had a few people look at both, we can discuss the
>> >>>>pros
>> >>>>>>>>and
>> >>>>>>>>>cons
>> >>>>>>>>>>>of each, then discuss the strategy for getting the best
>> >>features
>> >>>>>>>>in.
>> >>>>>>>>So
>> >>>>>>>>>>>far I've heard that Mohammad's patches have a better network
>> >>>>model
>> >>>>>>>>but
>> >>>>>>>>>the
>> >>>>>>>>>>>ARM patches have better checkpointing support; that seems
>> >>like a
>> >>>>>>>>good
>> >>>>>>>>>>>start.
>> >>>>>>>>>>>
>> >>>>>>>>>>>Steve
>> >>>>>>>>>>>
>> >>>>>>>>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
>> >>>>>>>>>***@arm.com
>> >>>>>>>>>>>
>> >>>>>>>>>>>wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>>>Hi all,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>Great work. However, I fundamentally do not believe in the
>> >>>>>>>>approach
>> >>>>>>>>of
>> >>>>>>>>>>>>Œletting reviewers pick the best features¹. There is no way
>> >>we
>> >>>>>>>>would
>> >>>>>>>>>>>>ever
>> >>>>>>>>>>>>get something working out if it. We need to get _one_
>>working
>> >>>>>>>>solution
>> >>>>>>>>>>>>here, and figure out how to best get there. I would propose
>> >>to
>> >>>>>>>>do it
>> >>>>>>>>>>>>bottom up, starting with the basic multi-simulator instance
>> >>>>>>>>support,
>> >>>>>>>>>>>>checkpointing support, and then move on to the network
>> >>between
>> >>>>>>>>the
>> >>>>>>>>>>>>simulator instances.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>Thus, I propose we go with the low-level plumbing and
>> >>>>checkpoint
>> >>>>>>>>>support
>> >>>>>>>>>>>>from what Curtis has posted. I believe proper checkpointing
>> >>>>>>>>support
>> >>>>>>>>to
>> >>>>>>>>>>>>be
>> >>>>>>>>>>>>the most challenging, and from what I can tell this is far
>> >>more
>> >>>>>>>>>limited
>> >>>>>>>>>>>>in
>> >>>>>>>>>>>>what you just posted Mohammad. Could you perhaps review
>> >>Curtis
>> >>>>>>>>patches
>> >>>>>>>>>>>>based on your insights, and we can try and get these patches
>> >>in
>> >>>>>>>>shape
>> >>>>>>>>>>>>and
>> >>>>>>>>>>>>committed asap.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>Once we have the baseline functionality in place, then we
>>can
>> >>>>>>>>start
>> >>>>>>>>>>>>looking at the more elaborate network models.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>Does this sound reasonable?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>Thanks,
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>Andreas
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
>> >>>>>>>>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu>
>> >>wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>>Hello All,
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>I have submitted a chain of patches which enables gem5 to
>> >>>>>>>>simulate
>> >>>>>>>>a
>> >>>>>>>>>>>>>cluster on multiple physical hosts:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>http://reviews.gem5.org/r/2909/
>> >>>>>>>>>>>>>http://reviews.gem5.org/r/2910/
>> >>>>>>>>>>>>>http://reviews.gem5.org/r/2912/
>> >>>>>>>>>>>>>http://reviews.gem5.org/r/2913/
>> >>>>>>>>>>>>>http://reviews.gem5.org/r/2914/
>> >>>>>>>><http://reviews.gem5.org/r/2914/>
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>and a patch that contains run scripts for a simple
>> >>experiment:
>> >>>>>>>>>>>>>http://reviews.gem5.org/r/2915/
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>We have run several benchmarks using this infrastructure,
>> >>>>>>>>including
>> >>>>>>>>>NAS
>> >>>>>>>>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
>> >>>>>>>>>>>>>(http://prof.ict.ac.cn/DCBench/),
>> >>>>>>>>>>>>>and would be happy to share scripts/diskimages.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or
>> >>>>less
>> >>>>>>>>the
>> >>>>>>>>>>>>same
>> >>>>>>>>>>>>>as
>> >>>>>>>>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
>> >>>>>>>>*network
>> >>>>>>>>>>>>model
>> >>>>>>>>>>>>>is
>> >>>>>>>>>>>>>more thorough; it also enables modeling different network
>> >>>>>>>>topologies.
>> >>>>>>>>>>>>>Having both set of changes together let reviewers to pick
>> >>best
>> >>>>>>>>>features
>> >>>>>>>>>>>>>from both works.
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>>Thank you,
>> >>>>>>>>>>>>>Mohammad Alian
>> >>>>>>>>>>>>>_______________________________________________
>> >>>>>>>>>>>>>gem5-dev mailing list
>> >>>>>>>>>>>>>gem5-***@gem5.org
>> >>>>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>> >>>>>>>>attachments
>> >>>>>>>>>are
>> >>>>>>>>>>>>confidential and may also be privileged. If you are not the
>> >>>>>>>>intended
>> >>>>>>>>>>>>recipient, please notify the sender immediately and do not
>> >>>>>>>>disclose
>> >>>>>>>>>the
>> >>>>>>>>>>>>contents to any other person, use it for any purpose, or
>> >>store
>> >>>>or
>> >>>>>>>>copy
>> >>>>>>>>>>>>the
>> >>>>>>>>>>>>information in any medium. Thank you.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge
>> >>CB1
>> >>>>>>>>9NJ,
>> >>>>>>>>>>>>Registered in England & Wales, Company No: 2557590
>> >>>>>>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
>> >>>>Cambridge
>> >>>>>>>>CB1
>> >>>>>>>>>>>>9NJ,
>> >>>>>>>>>>>>Registered in England & Wales, Company No: 2548782
>> >>>>>>>>>>>>_______________________________________________
>> >>>>>>>>>>>>gem5-dev mailing list
>> >>>>>>>>>>>>gem5-***@gem5.org
>> >>>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>>>>
>> >>>>>>>>>>>_______________________________________________
>> >>>>>>>>>>>gem5-dev mailing list
>> >>>>>>>>>>>gem5-***@gem5.org
>> >>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>> >>>>attachments
>> >>>>>>>>are
>> >>>>>>>>>>confidential and may also be privileged. If you are not the
>> >>>>intended
>> >>>>>>>>>>recipient, please notify the sender immediately and do not
>> >>>>disclose
>> >>>>>>>>the
>> >>>>>>>>>>contents to any other person, use it for any purpose, or store
>> >>or
>> >>>>>>>>copy
>> >>>>>>>>>the
>> >>>>>>>>>>information in any medium. Thank you.
>> >>>>>>>>>>
>> >>>>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge
>>CB1
>> >>>>9NJ,
>> >>>>>>>>>>Registered in England & Wales, Company No: 2557590
>> >>>>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
>> >>Cambridge
>> >>>>CB1
>> >>>>>>>>9NJ,
>> >>>>>>>>>>Registered in England & Wales, Company No: 2548782
>> >>>>>>>>>>_______________________________________________
>> >>>>>>>>>>gem5-dev mailing list
>> >>>>>>>>>>gem5-***@gem5.org
>> >>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>>
>> >>>>>>>>>_______________________________________________
>> >>>>>>>>>gem5-dev mailing list
>> >>>>>>>>>gem5-***@gem5.org
>> >>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>_______________________________________________
>> >>>>>>>>>gem5-dev mailing list
>> >>>>>>>>>gem5-***@gem5.org
>> >>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>>
>> >>>>>>>>_______________________________________________
>> >>>>>>>>gem5-dev mailing list
>> >>>>>>>>gem5-***@gem5.org
>> >>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>
>> >>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>> >>attachments
>> >>>>>>are
>> >>>>>>>>confidential and may also be privileged. If you are not the
>> >>>>intended
>> >>>>>>>>recipient, please notify the sender immediately and do not
>> >>disclose
>> >>>>>>the
>> >>>>>>>>contents to any other person, use it for any purpose, or store
>>or
>> >>>>copy
>> >>>>>>>>the
>> >>>>>>>>information in any medium. Thank you.
>> >>>>>>>>
>> >>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>>>9NJ,
>> >>>>>>>>Registered in England & Wales, Company No: 2557590
>> >>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>> >>>>CB1
>> >>>>>>>>9NJ,
>> >>>>>>>>Registered in England & Wales, Company No: 2548782
>> >>>>>>>>
>> >>>>>>>>_______________________________________________
>> >>>>>>>>gem5-dev mailing list
>> >>>>>>>>gem5-***@gem5.org
>> >>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>>>
>> >>>>>>>_______________________________________________
>> >>>>>>>gem5-dev mailing list
>> >>>>>>>gem5-***@gem5.org
>> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any
>>attachments
>> >>>>are
>> >>>>>>confidential and may also be privileged. If you are not the
>> >>intended
>> >>>>>>recipient, please notify the sender immediately and do not
>>disclose
>> >>>>the
>> >>>>>>contents to any other person, use it for any purpose, or store or
>> >>>>copy
>> >>>>>>the
>> >>>>>>information in any medium. Thank you.
>> >>>>>>
>> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>9NJ,
>> >>>>>>Registered in England & Wales, Company No: 2557590
>> >>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>> >>CB1
>> >>>>>>9NJ,
>> >>>>>>Registered in England & Wales, Company No: 2548782
>> >>>>>>_______________________________________________
>> >>>>>>gem5-dev mailing list
>> >>>>>>gem5-***@gem5.org
>> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>>>
>> >>>>>_______________________________________________
>> >>>>>gem5-dev mailing list
>> >>>>>gem5-***@gem5.org
>> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>
>> >>>>
>> >>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
>> >>are
>> >>>>confidential and may also be privileged. If you are not the intended
>> >>>>recipient, please notify the sender immediately and do not disclose
>> >>the
>> >>>>contents to any other person, use it for any purpose, or store or
>>copy
>> >>>>the
>> >>>>information in any medium. Thank you.
>> >>>>
>> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> >>>>Registered in England & Wales, Company No: 2557590
>> >>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>>>9NJ,
>> >>>>Registered in England & Wales, Company No: 2548782
>> >>>>_______________________________________________
>> >>>>gem5-dev mailing list
>> >>>>gem5-***@gem5.org
>> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>>>
>> >>>_______________________________________________
>> >>>gem5-dev mailing list
>> >>>gem5-***@gem5.org
>> >>>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>
>> >>
>> >>
>> >>
>> >>-- IMPORTANT NOTICE: The contents of this email and any attachments
>>are
>> >>confidential and may also be privileged. If you are not the intended
>> >>recipient, please notify the sender immediately and do not disclose
>>the
>> >>contents to any other person, use it for any purpose, or store or copy
>> >>the
>> >>information in any medium. Thank you.
>> >>
>> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> >>Registered in England & Wales, Company No: 2557590
>> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>> >>9NJ,
>> >>Registered in England & Wales, Company No: 2548782
>> >>_______________________________________________
>> >>gem5-dev mailing list
>> >>gem5-***@gem5.org
>> >>http://m5sim.org/mailman/listinfo/gem5-dev
>> >>
>> >_______________________________________________
>> >gem5-dev mailing list
>> >gem5-***@gem5.org
>> >http://m5sim.org/mailman/listinfo/gem5-dev
>>
>>
>>
>>
>>
>>
>> -- IMPORTANT NOTICE: The contents of this email and any attachments are
>> confidential and may also be privileged. If you are not the intended
>> recipient, please notify the sender immediately and do not disclose the
>> contents to any other person, use it for any purpose, or store or copy
>>the
>> information in any medium. Thank you.
>>
>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
>> Registered in England & Wales, Company No: 2557590
>> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
>>9NJ,
>> Registered in England & Wales, Company No: 2548782
>> _______________________________________________
>> gem5-dev mailing list
>> gem5-***@gem5.org
>> http://m5sim.org/mailman/listinfo/gem5-dev
>>
>_______________________________________________
>gem5-dev mailing list
>gem5-***@gem5.org
>http://m5sim.org/mailman/listinfo/gem5-dev


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782
Mohammad Alian
2015-07-07 15:11:38 UTC
Permalink
Then you are assuming taking checkpoint with quantum size smaller than link
latency which contradicts your initial motivation for unsync checkpoint!:
(I copied this sentence from earlier messages in the thread as a reminder)
"Shortening the quantum can help, but usually the snapshot is being taken
while 'fast-forwarding', i.e. simulating as fast as possible, which would
motivate a longer quantum."

What if somebody wants to relax synchronization and take checkpoint?

On Tue, Jul 7, 2015 at 7:38 AM, Gabor Dozsa <***@arm.com> wrote:

>
> Hi Mohammad and all,
>
> gem5 processes may restore at a different tick from a checkpoint but the
> next periodic sync will happen at the same tick in all gem5. A receive
> tick of a packet cannot fall into the current quantum so every packet can
> get scheduled for receive properly even if a checkpoint/restore happens
> during a quantum.
>
> Regarding your multi-threaded dual config, my understanding is that
> EtherLink is not prepared to work with multi threading as it lacks thread
> safety. The multiple event queues/threads config only works if the systems
> are independent.
>
> One possible way to fix that is to provide a "multi-thread” based
> implementation for MultiIface ;-)
>
> - Gabor
>
> On 7/7/15, 6:29 AM, "Mohammad Alian" <***@wisc.edu> wrote:
>
> >Gabor- My concern about unsync checkpoint is that when you restore from an
> >unsync checkpoint, you'll have gem5 processes that each is running in
> >different tick. Then how do you handle accurate delivery of packets
> >between
> >these gem5 processes? It will also make it harder to integrate
> >multi/pd-gem5 with current multi-threaded gem5. The problem with sync
> >checkpoint is that you cannot exactly take checkpoint at ROI, but I think
> >unsync checkpoint introduces some other problems. Considering the
> >necessary
> >warmup period before starting stat collection, I think we don't need to
> >exactly pinpoint the ROI. Please correct me if I'm wrong.
> >
> >I'm trying to run a multi-threaded experiment with pd-gem5, but I got an
> >error when I tried to partition dual mode simulation on two threads. I
> >posted that in gem5 users mailing list. Please help me on that if you can.
> >
> >Thank you,
> >Mohammad
> >
> >On Mon, Jul 6, 2015 at 11:45 AM, Gabor Dozsa <***@arm.com> wrote:
> >
> >> Thank you Steve for the detailed elaboration on the issues.
> >>
> >>
> >> Regarding the “unsynchronized checkpoints”, the terminology might be a
> >>bit
> >> confusing. In fact, we always need to do a global synchronization among
> >> the gem5 processes before taking a distributed checkpoint (in order to
> >> avoid in-flight packets). The global synchronization here means that
> >>each
> >> gem5 has to suspend the simulation and wait until every in-flight
> >>packets
> >> arrives (and is stored) at the destination gem5 process. If that global
> >> synchronization step happens at the same simulated tick in each gem5
> >>then
> >> the we call the checkpoint “synchronous” otherwise it is an
> >>“asynchronous”
> >> checkpoint.
> >>
> >> In the MPI application example I mentioned before the checkpoint should
> >>be
> >> triggered as soon as the “slowest” MPI process reaches the
> >>MPI_barrier().
> >> The problem is that the “slowest” MPI process usually does not reach the
> >> MPI_barrier() right at the end of the current quantum. If we let the
> >> simulation continue until the quantum completes (to ensure that the
> >> checkpoint is taken at the same simulated tick in each gem5) then the
> >>MPI
> >> processes will complete the MPI_barrier and start executing the ROI code
> >> already.
> >>
> >> Regarding the integration of multi-threaded/multi-host simulation,
> >> multi-gem5 does not support fine grain simulation of hierarchical
> >>switches
> >> (or any other network topologies except a single crossbar) or multiple
> >> synchronization domains currently.
> >>
> >> However, I'm a bit confused about your statement that you don’t see
> >>value
> >> in ever building a shared-memory transport for MultiIface. MultiIface in
> >> my view is just an abstract interface for “multi-(ether)-link" objects
> >> which are link objects for connecting multiple (i.e. more than two)
> >> systems. It aims to encapsulate the API necessary for any Link object
> >> in a any multi-system configuration - provided that we partition the
> >> systems across network links during run time.
> >>
> >> An orthogonal issue is if we want to include a simple crossbar switch
> >> model in a MultiIface implementation or we want to provide a
> >>‘standalone'
> >> fine grain model for the switch (e.g. the pd-gem5 approach).
> >>
> >> Thanks,
> >> - Gabor
> >>
> >>
> >>
> >> On 7/3/15, 7:33 PM, "Steve Reinhardt" <***@gmail.com> wrote:
> >>
> >> >Thanks Mohammad & Gabor for the responses.
> >> >
> >> >I think there's still some misunderstanding on what I mean by the
> >> >integration of multi-threaded and multi-host simulation based on
> >>Gabor's
> >> >response above and Andreas's response in the other thread.
> >> >
> >> >The primary example scenario I'm proposing is as Mohammad described:
> >> >within
> >> >each host node, we're simulating an entire rack + top-of-rack switch
> >>in a
> >> >single gem5 process, with separate event queues/threads being used to
> >> >parallelize across nodes within the rack. The switch may or may not be
> >>on
> >> >its own thread as well. The synchronization among the threads only
> >>needs
> >> >to be at the granularity of the intra-rack network latency.
> >> >
> >> >Now we want to expand this by using pd-gem5 or multi-gem5 to
> >>parallelize
> >> >multiple of these rack-level simulations across hosts, so we can
> >>simulate
> >> >a
> >> >whole row of a datacenter. Only the uplinks from the TOR switches
> >>would
> >> >need to go over sockets between processes, and the switch being
> >>modeled by
> >> >pd-gem5 or multi-gem5 would be the end-of-row switch. The
> >>synchronization
> >> >delay among the multiple gem5 processes would be based on the
> >>inter-rack
> >> >latency.
> >> >
> >> >So the basic question is: Is this feasible with pd-gem5 / multi-gem5,
> >>and
> >> >if not, how much work would it take to make it so?
> >> >
> >> >However, my larger point is that I still don't see value in ever
> >>building
> >> >a
> >> >shared-memory transport for MultiIface. For this model, there is
> >>clearly
> >> >no
> >> >need for it. Things get more complicated if we want to do something
> >>like
> >> >have N nodes connected to a single switch and split that over two hosts
> >> >(with N/2 nodes simulated on each), but even in that case, I think
> >>it's a
> >> >better idea to make the switch model deal with having half of its links
> >> >internal and half external (since we already want the same model to
> >>work
> >> >in
> >> >both the all-internal and all-external cases). Not that I'm worried
> >>that
> >> >someone is about to go off and build this shared-memory transport, but
> >>I
> >> >think it's important to reach an understanding here, since it's
> >> >fundamental
> >> >to defining the strategic relationship between these capabilities going
> >> >forward.
> >> >
> >> >Stepping back a little further, it would be nice to have a model that
> >>is
> >> >as
> >> >generic as the multi-threading model, where it's really just a matter
> >>of
> >> >taking a simulation, partitioning the components among the threads, and
> >> >setting the synchronization quantum, and it works. Of course, even with
> >> >the
> >> >multi-threaded model, if you don't choose your partitioning and your
> >> >quantum wisely, you're not going to get much speedup or a deterministic
> >> >simulation, but the fundamental implementation is oblivious to that.
> >>I'm
> >> >not saying we really need to go all the way to this extreme---it's
> >>pretty
> >> >reasonable to assume that no one in the near future will want to
> >>partition
> >> >across hosts anywhere other than on a simulated network link---but I
> >>think
> >> >we should keep this ideal in mind as a guiding principle as we choose
> >>how
> >> >to go forward from here.
> >> >
> >> >This ties in to my point #4, which is that if we're really building a
> >> >mechanism to partition a simulation across multiple hosts, then you
> >>should
> >> >be able to run the same simulation in a single gem5 process and get the
> >> >same results. I think this is the strength of pd-gem5; correspondingly
> >>the
> >> >main weakness of multi-gem5 is that it architecturally feels more like
> >> >tying together a set of mostly independent gem5 simulations than like
> >> >partitioning a single gem5 simulation. (Of course, they both end up at
> >> >roughly the same point in the middle.)
> >> >
> >> >On the flip side, multi-gem5 has some clear advantages in terms of the
> >> >better separation of the communication layer (and I can imagine it
> >>being
> >> >very useful to port to MPI and perhaps some RDMA API for InfiniBand
> >> >clusters). Also I think the integrated sockets for communication and
> >> >syncrhonization are the superior design; while the separate sockets
> >>used
> >> >by
> >> >pd-gem5 may only very rarely cause problems, I agree with Andreas that
> >> >that's not good enough, and I don't see any real advantage either---if
> >>you
> >> >have to flush the data sockets (or wait for them to drain) before
> >> >synchronizing, then you might as well just have the synchronization
> >> >messages queue up behind the data messages.
> >> >
> >> >Regarding unsynchronized checkpoints: Thanks for the example, but I'm
> >> >still
> >> >a little confused. If all the processes are about to execute an
> >> >MPI_Barrier(), doesn't that mean they'll all be synchronized shortly
> >> >anyway? So what's the harm until waiting until they're synchronized and
> >> >then checkpointing?
> >> >
> >> >Regarding the simulation of non-Ethernet networks: I agree that the
> >> >biggest
> >> >obstacle to this is the lack of generality of the current gem5 network
> >> >components. I tried to take a step toward supporting other link types
> >>two
> >> >years ago (see http://reviews.gem5.org/r/1922) but someone shot me
> down
> >> >;).
> >> >We shouldn't try and fix that here, but we should also consciously try
> >>not
> >> >to make it any worse...
> >> >
> >> >Thanks for reading all the way to the end!
> >> >
> >> >Steve
> >> >
> >> >
> >> >On Fri, Jul 3, 2015 at 7:11 AM Gabor Dozsa <***@arm.com>
> wrote:
> >> >
> >> >>Hi all,
> >> >>
> >> >>Thank you Steve for the thorough review.
> >> >>
> >> >>First, let me elaborate a bit on Andreas’s 3rd point about
> >> >>non-synchronous
> >> >>checkpoints. Let’s assume that we aim to simulate MPI applications
> >>(HPC
> >> >>workloads). The ROI in an MPI application is typically starts with a
> >> >>global MPI_Barrier() call. We want to take the checkpoint when *every*
> >> >>gem5 process is reached that MPI_Barrier() in the simulated code but
> >> >>that
> >> >>may not happen at the same tick in each gem5 (due to load imbalance
> >> >>among
> >> >>the simulated nodes). That’s why multi-gem5 implements the
> >> >>non-synchronous
> >> >>checkpoint support.
> >> >>
> >> >>My answers to your questions are as follows.
> >> >>
> >> >>1. The only change necessary to use multi-gem5 with a non Ethernet
> >> >>(simulated) network is to replace the Ethernet packet type with
> >>another
> >> >>packet type in MultiIface.
> >> >>In fact, the first implementation of MultiIface was a template
> >> >>that took EthPacketData as parameter because I plan to support
> >>different
> >> >>network types. When I realized that currently only Ethernet is
> >>supported
> >> >>by gem5 I dropped the template param to keep the implementation
> >> >>simpler. I
> >> >>have also realized in the meantime that the right approach would
> >> >>probably
> >> >>be to create a pure virtual ‘base' class for network packets from
> >>which
> >> >>Ethernet (and other types of) packets could be derived. Then
> >>MultiIface
> >> >>could simply use that base class to provide support for different
> >> >>network
> >> >>types. The interface provided by the base packet class could be very
> >> >>simple. Beside the total size() of the packet, multi-gem5 only needs a
> >> >>method to ‘extract' the source/destination address. Those addresses
> >>are
> >> >>used in MultiIface as opaque byte arrays so they are quite network
> >>type
> >> >>agnostic already.
> >> >>
> >> >>2. That’s right, we have designed the MultiIface/TCPIface split with
> >> >>different underlaying messaging systems in mind.
> >> >>
> >> >>3. Multi-gem5 can work together with multi-threaded/multi-event-queue
> >> >>gem5
> >> >>configs. The current TCPIface/tcp_server components would still use
> >> >>sockets to send around the packets. So it is possible to put together
> >>a
> >> >>multi-gem5 simulation where each gem5 process has multiple event
> >>queues
> >> >>(and an independent simulation thread per event queue) but all the
> >> >>simulated Ethernet links would use sockets to forward every Ethernet
> >> >>packet to the tcp_server.
> >> >>
> >> >>If someone wanted to run only a single gem5 process to simulate an
> >> >>entire
> >> >>cluster (using one thread/event-queue per cluster node) then the
> >>current
> >> >>multi-gem5 implementation using sockets/tcp_server is not optimal. In
> >> >>that
> >> >>case, a better solution would be to provide a shared memory based
> >> >>implementation of the MultiIface virtual communication methods
> >> >>sendRaw()/recvRaw()/syncRaw() (i.e. a shared memory equivalent of
> >> >>TCPIface). In that implementation, the entire discrete tcp_sever
> >> >>component
> >> >>could be replaced with a shared data structure.
> >> >>
> >> >>4. You are right, the current implementation does not make it possible
> >> >>to
> >> >>construct an equivalent single-process simulation model for a
> >>multi-gem5
> >> >>run. However, a possible solution is a shared memory based
> >> >>implementation
> >> >>of the MultiIface virtual communication methods just as I described in
> >> >>the
> >> >>previous paragraph. The same implementation could then work with both
> >> >>multi-threaded/multi-event-queues and single-thread/single-event-queue
> >> >>gem5 configs.
> >> >>
> >> >>Thanks,
> >> >>- Gabor
> >> >>
> >> >>On 7/2/15, 7:20 PM, "Steve Reinhardt" <***@gmail.com> wrote:
> >> >>
> >> >>>Hi everyone,
> >> >>>
> >> >>>Sorry for taking so long to engage. This is a great development and I
> >> >>>think
> >> >>>both these patches are terrific contributions. Thanks to Mohammad,
> >> >>Gabor,
> >> >>>and everyone else involved.
> >> >>>
> >> >>>I agree with Andreas that we should start with some top-level goals &
> >> >>>assumptions, agree on those, and then we can sort out the detailed
> >> >>issues
> >> >>>based on a consistent view.
> >> >>>
> >> >>>I definitely agree with Andreas's first two points. The third one
> >> >>seems a
> >> >>>little surprising; I'd like to hear more about the motivation before
> >> >>>expressing an opinion. I can see where non-synchronous checkpointing
> >> >>could
> >> >>>be useful, but it's also clear from the associated patch that it's
> >>not
> >> >>>trivial to implement either. How much would be lost by requiring a
> >> >>>synchronization before a checkpoint?
> >> >>>
> >> >>>From my personal perspective, I would like to see whatever we do here
> >> >>be a
> >> >>>first step toward a more general distributed simulation platform.
> >>Both
> >> >>of
> >> >>>these patches seem pretty Ethernet-centric in different ways. This is
> >> >>not
> >> >>>terrible; part of the problem is that gem5's current internal
> >> >>networking
> >> >>>support is already overly Ethernet-centric IMO. But it would be nice
> >>to
> >> >>>avoid baking that in even further. Rather than assume I have
> >>understood
> >> >>>all
> >> >>>the code completely, I'll phrase things in the form of questions, and
> >> >>>people can comment on how those questions would be answered in the
> >> >>context
> >> >>>of the two different approaches.
> >> >>>
> >> >>>1. How much effort would be required to simulate a non-Ethernet
> >> >>network?
> >> >>>My
> >> >>>impression is that pd-gem5 has a leg up here, since a gem5 switch
> >>model
> >> >>>for
> >> >>>a non-Ethernet network (which you'd have to write anyway if you were
> >> >>>simulating a different network) could be used in place of the current
> >> >>>Ethernet switch, where for multi-gem5 I think that the
> >> >>>util/multi//tcp_server.cc code would have to be modified (i.e.,
> >> >>there'd be
> >> >>>additional work above and beyond what you'd need to get the network
> >> >>>modeled
> >> >>>in base gem5).
> >> >>>
> >> >>>2. How much effort is required to run on a non-Ethernet network (or
> >> >>>equivalently using a non-sockets API)? The MultiIface/TCPIface split
> >> >>in
> >> >>>the multi-gem5 code looks like it addresses this nicely, but pd-gem5
> >> >>seems
> >> >>>pretty tied to an Ethernet host fabric.
> >> >>>
> >> >>>3. Do both of these patches work with the existing multithreaded
> >> >>>multiple-event-queue simulation? I think multi-gem5 does (though it
> >> >>would
> >> >>>be nice to have a confirmation), but it's not clear about pd-gem5. I
> >> >>don't
> >> >>>see a benefit to having multiple gem5 processes on a single host vs.
> >>a
> >> >>>single multithreaded gem5 process using the existing support. I think
> >> >>this
> >> >>>could be particularly valuable with a hierarchical network; e.g.,
> >> >>maybe I
> >> >>>would want to model a rack in multithreaded mode on a single
> >>multicore
> >> >>>server, then use pd-gem5 or multi-gem5 to build up a simulation of
> >> >>>multiple
> >> >>>racks. Would this work out of the box with either of these patches,
> >> >>and if
> >> >>>not, what would need to be done?
> >> >>>
> >> >>>4. Is it possible to construct a single-process simulation model
> >>that's
> >> >>>identical to the distributed simulation? It would be very valuable
> >>for
> >> >>>verification to be able to take a single simulation run and do it
> >>both
> >> >>>within a single process and also across multiple processes and verify
> >> >>that
> >> >>>identical results are achieved. This seems like a big drawback to the
> >> >>>multi-gem5 tcp_server approach, IMO.
> >> >>>
> >> >>>I'm definitely not saying that all these issues need to be resolved
> >> >>before
> >> >>>anything gets committed, but if we can agree that these are valid
> >> >>goals,
> >> >>>then we can evaluate detailed issues based on whether they move us
> >> >>toward
> >> >>>or away from those goals.
> >> >>>
> >> >>>Thanks,
> >> >>>
> >> >>>Steve
> >> >>>
> >> >>>
> >> >>>On Thu, Jul 2, 2015 at 8:34 AM Andreas Hansson
> >> >><***@arm.com>
> >> >>>wrote:
> >> >>>
> >> >>>>Hi all,
> >> >>>>
> >> >>>>I think we need to up-level this a bit. From our perspective (and I
> >> >>>>suspect in general):
> >> >>>>
> >> >>>>1. Robustness is important. Having a design that _may_ break,
> >>however
> >> >>>>unlikely is simply not an option.
> >> >>>>
> >> >>>>2. Performance and scaling is important. We can compare actual
> >>numbers
> >> >>>>here, and I am fairly sure the two solutions are on par. Let’s
> >> >>quantify
> >> >>>>that though.
> >> >>>>
> >> >>>>3. Checkpointing must not rely on synchronicity. It is vital for
> >> >>several
> >> >>>>workloads that we can checkpoint the various gem5 instances at
> >> >>different
> >> >>>>Ticks (due to the way the workloads are constructed).
> >> >>>>
> >> >>>>Andreas
> >> >>>>
> >> >>>>On 01/07/2015 21:41, "gem5-dev on behalf of Mohammad Alian"
> >> >>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu> wrote:
> >> >>>>
> >> >>>>>Thanks Gabor for the reply.
> >> >>>>>
> >> >>>>>I feel this conversation is useful as we can find out pros/cons of
> >> >>each
> >> >>>>>design.
> >> >>>>>Please find my response in-lined below.
> >> >>>>>
> >> >>>>>Thank you,
> >> >>>>>Mohammad
> >> >>>>>
> >> >>>>>On Wed, Jul 1, 2015 at 6:44 AM, Gabor Dozsa <***@arm.com>
> >> >>>>wrote:
> >> >>>>>
> >> >>>>>>Hi All,
> >> >>>>>>
> >> >>>>>>Sorry for the missing indentation in my previous e-mail! (This was
> >> >>my
> >> >>>>>>first e-mail to the dev-list so I could not simply use “reply").
> >> >>>>Below
> >> >>>>>>is
> >> >>>>>>the same message, hopefully in more readable form.
> >> >>>>>>
> >> >>>>>>====================================
> >> >>>>>>
> >> >>>>>>Hi All,
> >> >>>>>>
> >> >>>>>>Thank you Mohammad for your elaboration on the issues!
> >> >>>>>>
> >> >>>>>>I have written most of the multi-gem5 patch so let me add some
> >>more
> >> >>>>>>clarifications and answer to your concerns. My comments are
> >>inline
> >> >>>>>>below.
> >> >>>>>>
> >> >>>>>>Thanks,
> >> >>>>>>- Gabor
> >> >>>>>>
> >> >>>>>>On 6/27/15, 10:20 AM, "Mohammad Alian" <***@wisc.edu> wrote:
> >> >>>>>>
> >> >>>>>>>Hi All,
> >> >>>>>>>
> >> >>>>>>>Curtis-Thank you for listing some of the differences. I was
> >> >>waiting
> >> >>>>for
> >> >>>>>>>the
> >> >>>>>>>completed multi-gem5 patch before I send my review. Please see my
> >> >>>>>>inline
> >> >>>>>>>response below. I¹ve addressed the concerns that you¹ve raised.
> >> >>>>Also,
> >> >>>>>>I¹ve
> >> >>>>>>>added a bit more to the comparison.
> >> >>>>>>>
> >> >>>>>>>-* Synchronization.
> >> >>>>>>>
> >> >>>>>>>pd-gem5 implements this in Python (not a problem in itself;
> >> >>>>>>aesthetically
> >> >>>>>>>
> >> >>>>>>>this is nice, but...). The issue is that pd-gem5's data packets
> >> >>and
> >> >>>>>>>
> >> >>>>>>>barrier messages travel over different sockets. Since pd-gem5
> >> >>could
> >> >>>>>>see
> >> >>>>>>>
> >> >>>>>>>data packets passing synchronization barriers, it could create an
> >> >>>>>>>
> >> >>>>>>>inconsistent checkpoint.
> >> >>>>>>>
> >> >>>>>>>multi-gem5's synchronization is implemented in C++ using sync
> >> >>>>events,
> >> >>>>>>but
> >> >>>>>>>
> >> >>>>>>>more importantly, the messages queue up in the same stream and so
> >> >>>>>>cannot
> >> >>>>>>>
> >> >>>>>>>have the issue just described. (Event ordering is often crucial
> >> >>in
> >> >>>>>>>
> >> >>>>>>>snapshot protocols.) Therefore we feel that multi-gem5 is a more
> >> >>>>robust
> >> >>>>>>>
> >> >>>>>>>solution in this respect.
> >> >>>>>>>
> >> >>>>>>>Each packet in pd-gem5 has a time-stamp. So even if data packets
> >> >>>>pass
> >> >>>>>>>synchronization barriers (in another word data packets arrive
> >> >>early
> >> >>>>at
> >> >>>>>>the
> >> >>>>>>>destination node), destination node process packets based on
> >>their
> >> >>>>>>>timestamp. Actually allowing data packets to pass sync barriers
> >> >>is a
> >> >>>>>>nice
> >> >>>>>>>feature that can reduce the likelihood of late packet reception.
> >> >>>>>>Ordering
> >> >>>>>>>of data messages that flow over pd-gem5 nodes is also preserved
> >>in
> >> >>>>>>pd-gem5
> >> >>>>>>>implementation.
> >> >>>>>>
> >> >>>>>>This seems to be a misunderstanding. Maybe the wording was not
> >> >>>>precise
> >> >>>>>>before.The problem is not a data packet that “passing" a sync
> >> >>barrier
> >> >>>>>>but the other way around, a sync barrier that can pass a data
> >> >>packet
> >> >>>>>>(e.g. while the data packet is waiting in the host operating
> >>system
> >> >>>>>>socket layer). If that happens, the packet will arrive later than
> >> >>it
> >> >>>>>>was
> >> >>>>>>supposed to and it may miss the computed receive tick.
> >> >>>>>>
> >> >>>>>>For instance, let’s assume that the quantum coincides with the
> >> >>>>simulated
> >> >>>>>>Ether link delay. (This is the optimal choice of quantum to
> >> >>minimize
> >> >>>>the
> >> >>>>>>number of sync barriers.) If a data packet is sent right at the
> >> >>>>>>beginning
> >> >>>>>>of a quantum then this packet must arrive at the destination gem5
> >> >>>>>>process
> >> >>>>>>within the same quantum in order not to miss its receive tick at
> >> >>the
> >> >>>>>>very
> >> >>>>>>beginning of the next quantum. If the sync barrier can pass the
> >> >>data
> >> >>>>>>packet
> >> >>>>>>then the data packet may arrive only during the next quantum (or
> >> >>in
> >> >>>>>>extreme conditions even later than that) so when it arrives the
> >> >>>>receiver
> >> >>>>>>gem5 may pass already the receive tick.
> >> >>>>>>
> >> >>>>>>This argument makes more sense than the previous one. Note that
> >> >>gem5
> >> >>>>is
> >> >>>>>>a
> >> >>>>>cycle accurate simulator and it runs orders of magnitude slower
> >>that
> >> >>>>real
> >> >>>>>hardware. So it's almost impossible that the flight time of packet
> >> >>>>through
> >> >>>>>real network turns to be more that simulation time of one quantum.
> >>We
> >> >>>>ran
> >> >>>>>a
> >> >>>>>set of experiments just for this purpose: with quantum size equal
> >>to
> >> >>>>>etherlink delay, we never got any late arrival violation (what you
> >> >>>>>described) for full NAS benchmarks suit (please refer to the
> >>paper).
> >> >>>>>
> >> >>>>>multi-gem5 is optimized for a case that almost never happens! and
> >> >>>>>scarifying speedup for no gain.
> >> >>>>>
> >> >>>>>
> >> >>>>>>Time-stamping does help with this issue. Also, if a data packet is
> >> >>>>>>waiting
> >> >>>>>>in the host operating system socket layer when the simulation
> >> >>thread
> >> >>>>>>exits
> >> >>>>>>to python to complete the next sync barrier then the packet will
> >> >>>>not go
> >> >>>>>>into the checkpoint that may follow that sync barrier.
> >> >>>>>>
> >> >>>>>>That's a good point. Current pd-gem5 checkpointing mechanism might
> >> >>>>miss
> >> >>>>>packets that have been sent during previous quantum and are waiting
> >> >>in
> >> >>>>OS
> >> >>>>>socket buffer. I should add some code inside ethertap serialization
> >> >>>>>function to drain ethertap socket before writing checkpoint. I will
> >> >>>>update
> >> >>>>>pd-gem5 patch accordingly.
> >> >>>>>
> >> >>>>>>
> >> >>>>>>>What you mentioned as an advantage for multi-gem5 is actually a
> >> >>key
> >> >>>>>>>disadvantage: buffering sync messages behind data packets can add
> >> >>>>up to
> >> >>>>>>>the
> >> >>>>>>>synchronization overhead and slow down simulation significantly.
> >> >>>>>>
> >> >>>>>>The purpose of sync messages is to make sure that the data packets
> >> >>>>>>arrive
> >> >>>>>>in time (in terms of simulated time) at the destination so they
> >>can
> >> >>>>be
> >> >>>>>>scheduled for being received at the proper computed tick. Sync
> >> >>>>messages
> >> >>>>>>also make sure that no data packets are in flight when a sync
> >> >>barrier
> >> >>>>>>completes before we take a checkpoint. They definitely add
> >> >>overhead
> >> >>>>for
> >> >>>>>>the simulation but they are necessary for the correctness of the
> >> >>>>>>simulation.
> >> >>>>>>
> >> >>>>>>The receive thread in multi-gem5 reads out packets from the socket
> >> >>in
> >> >>>>>>parallel with the simulation thread so packets normally will not
> >>be
> >> >>>>>>"queueing up” before a sync barrier message. There is definitely
> >> >>>>room
> >> >>>>>>for improvements in the current implementation for reducing the
> >> >>>>>>synchronization overhead but that is likely true for pd-gem5, too.
> >> >>>>>>The important thing here is that the solution must provide
> >> >>>>correctness
> >> >>>>>>(robustness) first.
> >> >>>>>>
> >> >>>>>>pd-gem5 provides correctness. Please read my previous comment. The
> >> >>>>whole
> >> >>>>>purpose of multi/pd-gem5 is to parallelize simulation with minimal
> >> >>>>>overhead
> >> >>>>>and gain speedup. If you fail to do so, nobody will use your tool.
> >> >>>>>
> >> >>>>>
> >> >>>>>>>Also,
> >> >>>>>>>multi-gem5 send huge sized messages (multiHeaderPkt) through
> >> >>>>network to
> >> >>>>>>>perform each synchronization point, which increases
> >> >>synchronization
> >> >>>>>>>overhead further. In pd-gem5, we choose to send just one
> >>character
> >> >>>>as
> >> >>>>>>sync
> >> >>>>>>>message through a separate socket to reduce synchronization
> >> >>>>overhead.
> >> >>>>>>
> >> >>>>>>The TCP/IP message size is unlikely the bottleneck here.
> >>Multi-gem5
> >> >>>>will
> >> >>>>>>send ~50 bytes more in a sync barrier message than pd-gem5 but
> >>that
> >> >>>>>>bigger
> >> >>>>>>sync message still fits into a single ethernet frame on the wire.
> >> >>The
> >> >>>>>>end-to-end latency overhead that is caused by 50 bytes extra
> >> >>payload
> >> >>>>for
> >> >>>>>>a small single frame TCP/IP message is likely to fall into the
> >> >>>>“noise"
> >> >>>>>>category if one tries to measure it in a real cluster.
> >> >>>>>>
> >> >>>>>>You should prove your hypothesis experimentally. Each gem5 process
> >> >>>>>send/receive sync messages at the end of every quantum. Say you are
> >> >>>>>simulating "N" node computer cluster with "M" different
> >> >>configuration.
> >> >>>>>Then
> >> >>>>>you will have N*M gem5 processes that send/receive these 50 Bytes
> >>(it
> >> >>>>>think
> >> >>>>>it's more) extra data at the same time over network ...
> >> >>>>>
> >> >>>>>Furthermore, multi-gem5 send a header before each data message.
> >> >>>>Comparing
> >> >>>>>with pd-gem5, pd-gem5 just add 12 Bytes (each time-stamp is 12
> >>least
> >> >>>>>significant digits of the Tick) to each data packet. I don't know
> >> >>>>exactly
> >> >>>>>how large are these "MultiHeaderPkt", but it just has two Tick
> >>field
> >> >>>>that
> >> >>>>>each is 64 Bytes! Also, header packets are separate TCP packets, so
> >> >>you
> >> >>>>>pay
> >> >>>>>for sending two separate packets for each data packet. And worst,
> >>you
> >> >>>>>serialize all of these with sync messages.
> >> >>>>>
> >> >>>>>
> >> >>>>>>>
> >> >>>>>>>* Packet handling.
> >> >>>>>>>
> >> >>>>>>>pd-gem5 uses EtherTap for data packets but changed the polling
> >> >>>>>>mechanism
> >> >>>>>>>
> >> >>>>>>>to go through the main event queue. Since this rate is actually
> >> >>>>linked
> >> >>>>>>>
> >> >>>>>>>with simulator progress, it cannot guarantee that the packets are
> >> >>>>>>>serviced
> >> >>>>>>>
> >> >>>>>>>at regular intervals of real time. This can lead to packets
> >> >>>>queueing
> >> >>>>>>up
> >> >>>>>>>
> >> >>>>>>>which would contribute to the synchronization issues mentioned
> >> >>>>above.
> >> >>>>>>>
> >> >>>>>>>multi-gem5 uses plain sockets with separate receive threads and
> >>so
> >> >>>>does
> >> >>>>>>>not
> >> >>>>>>>
> >> >>>>>>>have this issue.
> >> >>>>>>>
> >> >>>>>>>I think again you are pointing to your first concern that I¹ve
> >> >>>>>>explained
> >> >>>>>>>above. Packets that have queued up in EtherTap socket, will be
> >> >>>>>>processed
> >> >>>>>>>and delivered to simulation environment at the beginning of next
> >> >>>>>>>simulation
> >> >>>>>>>quantum.
> >> >>>>>>>
> >> >>>>>>>Please notice that multi-gem5 introduces a new simObjects to
> >> >>>>interface
> >> >>>>>>>simulation environment to real world which is redundant. This
> >> >>>>>>>functionality
> >> >>>>>>>is already there by EtherTap.
> >> >>>>>>
> >> >>>>>>Except that the EtherTap solution does not provide a correct
> >> >>(robust)
> >> >>>>>>solution for the synchronization problem.
> >> >>>>>>
> >> >>>>>>Please read my first/second comments.
> >> >>>>>
> >> >>>>>
> >> >>>>>>>
> >> >>>>>>>* Checkpoint accuracy.
> >> >>>>>>>
> >> >>>>>>>A user would like to have a checkpoint at precisely the time the
> >> >>>>>>>
> >> >>>>>>>'m5 checkpoint' operation is executed so as to not miss any of
> >>the
> >> >>>>>>>
> >> >>>>>>>area of interest in his application.
> >> >>>>>>>
> >> >>>>>>>pd-gem5 requires that simulation finish the current quantum
> >> >>>>>>>
> >> >>>>>>>before checkpointing, so it cannot provide this.
> >> >>>>>>>
> >> >>>>>>>(Shortening the quantum can help, but usually the snapshot is
> >> >>being
> >> >>>>>>taken
> >> >>>>>>>
> >> >>>>>>>while 'fast-forwarding', i.e. simulating as fast as possible,
> >> >>which
> >> >>>>>>would
> >> >>>>>>>
> >> >>>>>>>motivate a longer quantum.)
> >> >>>>>>>
> >> >>>>>>>multi-gem5 can enter the drain cycle immediately upon receiving a
> >> >>>>>>>
> >> >>>>>>>checkpoint request. We find this accuracy highly desirable.
> >> >>>>>>>
> >> >>>>>>>It¹s true that if you have a large quantum size then there would
> >> >>be
> >> >>>>>>some
> >> >>>>>>>discrepancy between the m5_ckpt instruction tick and the actual
> >> >>dump
> >> >>>>>>tick.
> >> >>>>>>>Based on multi-gem5 code, my understanding is that you send async
> >> >>>>>>>checkpoint message as soon as one of the gem5 processes encounter
> >> >>>>>>m5_ckpt
> >> >>>>>>>instruction. But I¹m not sure how you fix the aforementioned
> >> >>issue,
> >> >>>>>>>because
> >> >>>>>>>you have to sync all gem5 processes before you start dumping
> >> >>>>>>checkpoint,
> >> >>>>>>>which necessitate a global synchronization beforehand.
> >> >>>>>>
> >> >>>>>>In multi-gem5, the gem5 process who encounters the m5_ckpt
> >> >>>>instruction
> >> >>>>>>sends out an async checkpoint notification for the peer gem5
> >> >>>>processes
> >> >>>>>>and
> >> >>>>>>then it starts the draining immediately (at the same tick). So
> >>the
> >> >>>>>>checkpoint will be taken at the exact tick form the initiator
> >> >>process
> >> >>>>>>point of view. The global synchronisation with the peer processes
> >> >>>>takes
> >> >>>>>>place while the initiator process is still waiting at the same
> >>tick
> >> >>>>(i.e
> >> >>>>>>the simulation thread is suspended). However, the receiver thread
> >> >>>>>>Continues reading out the socket - while waiting for the global
> >> >>sync
> >> >>>>to
> >> >>>>>>complete- to make sure that in-flight data packets from peer gem5
> >> >>>>>>processes
> >> >>>>>>are stored properly and saved into the checkpoint.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>So you mean multi-gem5 ends up with having gem5 processes with
> >> >>>>different
> >> >>>>>ticks after checkpoint? In pd-gem5 we make sure that all gem5
> >> >>processes
> >> >>>>>start dumping checkpoint at the same tick. Are you sure that this
> >>is
> >> >>>>>correct to have each gem5 process dump checkpoint at different
> >> >>ticks???
> >> >>>>>
> >> >>>>>I don't think this a correct checkpointing design. However, if you
> >> >>>>feel it
> >> >>>>>is correct, I can change a couple of lines in "Simulation.py" and
> >> >>>>barrier
> >> >>>>>scripts to implement the same functionality in pd-gem5. One thing
> >> >>that
> >> >>>>you
> >> >>>>>are obsessed about is to make sure that there is no in-flight
> >>packets
> >> >>>>>while
> >> >>>>>we start dumping checkpoint, and you have all these complex
> >> >>mechanisms
> >> >>>>in
> >> >>>>>place to insure that! I think you can 99.99999% make sure that
> >>there
> >> >>>>is no
> >> >>>>>in-flight packet by waiting for 1 second after all gem5 processes
> >> >>>>finished
> >> >>>>>their quantum simulation and then dump checkpoint. Do you really
> >> >>think
> >> >>>>>that
> >> >>>>>delivering a tcp packet would take more than 1 second in today's
> >> >>>>systems!?
> >> >>>>>Always go for simple solutions ...
> >> >>>>>
> >> >>>>>
> >> >>>>>
> >> >>>>>>>
> >> >>>>>>>By the way, we have a fix for this issue by introducing a new m5
> >> >>>>pseudo
> >> >>>>>>>instruction.
> >> >>>>>>
> >> >>>>>>I fail to see how a new pseudo instruction can solve the problem
> >>of
> >> >>>>>>completing the full quantum in pd-gem5 before a checkpoint can be
> >> >>>>taken.
> >> >>>>>>Could you please elaborate on that?
> >> >>>>>>
> >> >>>>>>As we take checkpoint while fast-forwarding and it is likely that
> >> >>we
> >> >>>>>>relax
> >> >>>>>synchronization for speedup purpose, a new pseudo instruction that
> >> >>can
> >> >>>>set
> >> >>>>>quantum size (m5_qset) can be helpful. So, one can insert m5_qset
> >>in
> >> >>>>his
> >> >>>>>benchmark source code before entering ROI that contains m5_ckpt to
> >> >>>>>decrease
> >> >>>>>quantum size beforehand and reduce the discrepancy between m5_ckpt
> >> >>tick
> >> >>>>>and
> >> >>>>>actual checkpoint tick. This is not included in pd-gem5 patch right
> >> >>>>now.
> >> >>>>>
> >> >>>>>
> >> >>>>>>>
> >> >>>>>>>* Implementation of network topology.
> >> >>>>>>>
> >> >>>>>>>pd-gem5 uses a separate gem5 process to act as a switch whereas
> >> >>>>>>multi-gem5
> >> >>>>>>>
> >> >>>>>>>uses a standalone packet relay process.
> >> >>>>>>>
> >> >>>>>>>We haven't measured the overhead of pd-gem5's simulated switch
> >> >>yet,
> >> >>>>but
> >> >>>>>>>
> >> >>>>>>>we're confident that our approach is at least as fast and more
> >> >>>>>>scalable.
> >> >>>>>>>
> >> >>>>>>>There is this flexibility in pd-gem5 to simulate a switch box
> >> >>>>alongside
> >> >>>>>>>one
> >> >>>>>>>of the other gem5 processes. However, it might make that gem5
> >> >>>>process
> >> >>>>>>the
> >> >>>>>>>simulation bottleneck. One of the advantages of pd-gem5 over
> >> >>>>>>multi-gem5 is
> >> >>>>>>>that we use gem5 to simulate a switch box, which allows us to
> >> >>model
> >> >>>>any
> >> >>>>>>>network topology by instantiating several Switch simObjects and
> >> >>>>>>>interconnect them with EhterLink in an arbitrary fashion. A
> >> >>>>standalone
> >> >>>>>>tcp
> >> >>>>>>>server just can provide switch functionality (forwarding packets
> >> >>to
> >> >>>>>>>destinations) and model a star network topology. Furthermore, it
> >> >>>>cannot
> >> >>>>>>>model various network timings such as queueing delay, congestion,
> >> >>>>and
> >> >>>>>>>routing latency. Also it has some accuracy issues that I will
> >> >>point
> >> >>>>out
> >> >>>>>>>next.
> >> >>>>>>
> >> >>>>>>I agree with the complex topology argument. We already mentioned
> >> >>that
> >> >>>>>>before as an advantage for pd-gem5 from the point of view of
> >>future
> >> >>>>>>extensions. However, I do not agree that multi-gem5 cannot model
> >> >>>>>>queueing
> >> >>>>>>delays and congestions. For a simple crossbar switch, it can model
> >> >>>>>>queueing
> >> >>>>>>delays and congestions, but the receive queues are distributed
> >> >>among
> >> >>>>the
> >> >>>>>>gem5 processes.
> >> >>>>>>
> >> >>>>>>It's true that you can model queuing delay of a simple crossbar by
> >> >>>>>distributing queues across gem5 processes (end points). But to be
> >> >>able
> >> >>>>to
> >> >>>>>do so you have to ensure the ordering of packets that you enqueue
> >>in
> >> >>>>the
> >> >>>>>distributed queues. It is almost impossible without a synchronized
> >> >>>>switch
> >> >>>>>box. You should have a reorder queue that reorders packets
> >> >>dynamically
> >> >>>>and
> >> >>>>>updates timing parameter for each packet as well. I don't know how
> >> >>much
> >> >>>>>progress have you had to ensure ordering scheme in multi-gem5 but
> >>you
> >> >>>>may
> >> >>>>>already realized that how complex and error prone it can be. This
> >> >>>>argument
> >> >>>>>is also related to my next argument for "Broken network timing".
> >> >>>>>
> >> >>>>>
> >> >>>>>>>
> >> >>>>>>>* Broken network timing:
> >> >>>>>>>
> >> >>>>>>>Forwarding packets between gem5 processes using a standalone tcp
> >> >>>>server
> >> >>>>>>>can
> >> >>>>>>>cause reordering between packets that have different source but
> >> >>same
> >> >>>>>>>destination. It causes inaccurate network timing and worse of
> >>all
> >> >>>>>>>non-deterministic simulation. pd-gem5 resolve this by reordering
> >> >>>>>>packets
> >> >>>>>>>at
> >> >>>>>>>Switch process and then send them to their destination (it¹s
> >> >>>>possible
> >> >>>>>>as
> >> >>>>>>>switch is synchronized with the rest of the nodes).
> >> >>>>>>
> >> >>>>>>In multi-gem5, there is always a HeaderPkt that contains some meta
> >> >>>>>>information for each data packet. The meta information include the
> >> >>>>send
> >> >>>>>>tick and the sender rank (i.e. a unique ID of the sender gem5
> >> >>>>process).
> >> >>>>>>We use those information to define a well defined ordering of
> >> >>packets
> >> >>>>>>even
> >> >>>>>>if packets are arriving at the same receiver from different
> >> >>senders.
> >> >>>>>>This
> >> >>>>>>packet ordering scheme is still being tested so the corresponding
> >> >>>>patch
> >> >>>>>>is
> >> >>>>>>not on the RB yet.
> >> >>>>>>
> >> >>>>>>Please read my previous comment. The most important part of
> >> >>>>>>multi/pd-gem5
> >> >>>>>extension is ensuring accurate and deterministic simulation.
> >> >>>>>
> >> >>>>>
> >> >>>>>>>
> >> >>>>>>>* Amount of changes
> >> >>>>>>>
> >> >>>>>>>pd-gem5 introduce different modes in etherlink just to provide
> >> >>>>accurate
> >> >>>>>>>timing for each component in the network subsystem (NIC, link,
> >> >>>>switch)
> >> >>>>>>as
> >> >>>>>>>well as capability of modeling different network topologies
> >>(mesh,
> >> >>>>>>ring,
> >> >>>>>>>fat tree, etc). To enable a simple functionality, like what
> >> >>>>multi-gem5
> >> >>>>>>>provides, the amount of changes in gem5 can be limited to
> >> >>>>time-stamping
> >> >>>>>>>packets and providing synchronization through python scripts.
> >> >>>>However,
> >> >>>>>>>multi-gem5 re-implements functionalists that are already in gem5.
> >> >>>>>>
> >> >>>>>>This argument holds only if both implementations are correct
> >> >>>>(robust).
> >> >>>>>>It
> >> >>>>>>still seems to me that pd-gem5 does not provide correctness for
> >>the
> >> >>>>>>synchronization/checkpointing parts.
> >> >>>>>>
> >> >>>>>>Again, please read my first comment for correctness of pd-gem5.
> >> >>>>>
> >> >>>>>
> >> >>>>>>>
> >> >>>>>>>* Integrating with gem5 mainstream:
> >> >>>>>>>
> >> >>>>>>>pd-gem5 launch script is written in python which is suited for
> >> >>>>>>integration
> >> >>>>>>>with gem5 python scripts. However multi-gem5 uses bash script.
> >> >>Also,
> >> >>>>>>all
> >> >>>>>>>source files in pd-gem5 are already parts of gem5 mainstream.
> >> >>>>However
> >> >>>>>>>multi-gem5 has tcp_server.cc/hh that is a standalone process and
> >> >>>>cannot
> >> >>>>>>be
> >> >>>>>>>part of gem5.
> >> >>>>>>
> >> >>>>>>The multi-gem5 launch script is simply enough to rely only on the
> >> >>>>>>shell. It
> >> >>>>>>can obviously be easily re-written in python if that added any
> >> >>value.
> >> >>>>>>The
> >> >>>>>>tcp_server component is only a utility (like the "m5" utility that
> >> >>is
> >> >>>>>>also
> >> >>>>>>part of gem5).
> >> >>>>>>
> >> >>>>>>The thing is that it's more likely that users want to add some
> >> >>>>>functionality to the run-script of multi/pd-gem5. E.g. pd-gem5
> >> >>>>run-script
> >> >>>>>supports launching simulations using a simulation pool management
> >> >>>>>software (
> >> >>>>>http://research.cs.wisc.edu/htcondor/). Using python enables users
> >>to
> >> >>>>>easily add these kind of supports.
> >> >>>>>
> >> >>>>>
> >> >>>>>>
> >> >>>>>>Cheers,
> >> >>>>>>- Gabor
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>>On Fri, Jun 26, 2015 at 8:40 PM, Curtis Dunham
> >> >>>><***@arm.com>
> >> >>>>>>>wrote:
> >> >>>>>>>
> >> >>>>>>>>Hello everyone,
> >> >>>>>>>>We have taken a look at how pd-gem5 compares with multi-gem5.
> >> >>>>While
> >> >>>>>>>>intending
> >> >>>>>>>>to deliver the same functionality, there are some crucial
> >> >>>>differences:
> >> >>>>>>>>
> >> >>>>>>>>* Synchronization.
> >> >>>>>>>>
> >> >>>>>>>> pd-gem5 implements this in Python (not a problem in itself;
> >> >>>>>>>>aesthetically
> >> >>>>>>>> this is nice, but...). The issue is that pd-gem5's data
> >> >>>>packets
> >> >>>>>>and
> >> >>>>>>>> barrier messages travel over different sockets. Since
> >> >>pd-gem5
> >> >>>>>>could
> >> >>>>>>>>see
> >> >>>>>>>> data packets passing synchronization barriers, it could
> >> >>create
> >> >>>>an
> >> >>>>>>>> inconsistent checkpoint.
> >> >>>>>>>>
> >> >>>>>>>> multi-gem5's synchronization is implemented in C++ using
> >>sync
> >> >>>>>>events,
> >> >>>>>>>>but
> >> >>>>>>>> more importantly, the messages queue up in the same stream
> >> >>and
> >> >>>>so
> >> >>>>>>>>cannot
> >> >>>>>>>> have the issue just described. (Event ordering is often
> >> >>>>crucial
> >> >>>>>>in
> >> >>>>>>>> snapshot protocols.) Therefore we feel that multi-gem5 is a
> >> >>>>more
> >> >>>>>>>>robust
> >> >>>>>>>> solution in this respect.
> >> >>>>>>>>
> >> >>>>>>>>* Packet handling.
> >> >>>>>>>>
> >> >>>>>>>> pd-gem5 uses EtherTap for data packets but changed the
> >> >>polling
> >> >>>>>>>>mechanism
> >> >>>>>>>> to go through the main event queue. Since this rate is
> >> >>>>actually
> >> >>>>>>>>linked
> >> >>>>>>>> with simulator progress, it cannot guarantee that the
> >>packets
> >> >>>>are
> >> >>>>>>>>serviced
> >> >>>>>>>> at regular intervals of real time. This can lead to packets
> >> >>>>>>>>queueing up
> >> >>>>>>>> which would contribute to the synchronization issues
> >> >>mentioned
> >> >>>>>>above.
> >> >>>>>>>>
> >> >>>>>>>> multi-gem5 uses plain sockets with separate receive threads
> >> >>>>and so
> >> >>>>>>>>does
> >> >>>>>>>>not
> >> >>>>>>>> have this issue.
> >> >>>>>>>>
> >> >>>>>>>>* Checkpoint accuracy.
> >> >>>>>>>>
> >> >>>>>>>> A user would like to have a checkpoint at precisely the time
> >> >>the
> >> >>>>>>>> 'm5 checkpoint' operation is executed so as to not miss any
> >>of
> >> >>>>the
> >> >>>>>>>> area of interest in his application.
> >> >>>>>>>>
> >> >>>>>>>> pd-gem5 requires that simulation finish the current quantum
> >> >>>>>>>> before checkpointing, so it cannot provide this.
> >> >>>>>>>>
> >> >>>>>>>> (Shortening the quantum can help, but usually the snapshot is
> >> >>>>being
> >> >>>>>>>>taken
> >> >>>>>>>> while 'fast-forwarding', i.e. simulating as fast as possible,
> >> >>>>which
> >> >>>>>>>>would
> >> >>>>>>>> motivate a longer quantum.)
> >> >>>>>>>>
> >> >>>>>>>> multi-gem5 can enter the drain cycle immediately upon
> >> >>receiving
> >> >>>>a
> >> >>>>>>>> checkpoint request. We find this accuracy highly desirable.
> >> >>>>>>>>
> >> >>>>>>>>* Implementation of network topology.
> >> >>>>>>>>
> >> >>>>>>>> pd-gem5 uses a separate gem5 process to act as a switch
> >> >>whereas
> >> >>>>>>>>multi-gem5
> >> >>>>>>>> uses a standalone packet relay process.
> >> >>>>>>>>
> >> >>>>>>>> We haven't measured the overhead of pd-gem5's simulated
> >>switch
> >> >>>>yet,
> >> >>>>>>>>but
> >> >>>>>>>> we're confident that our approach is at least as fast and
> >>more
> >> >>>>>>>>scalable.
> >> >>>>>>>>
> >> >>>>>>>>
> >> >>>>>>>>Thanks,
> >> >>>>>>>>Curtis
> >> >>>>>>>>________________________________________
> >> >>>>>>>>From: gem5-dev [gem5-dev-***@gem5.org] On Behalf Of
> Mohammad
> >> >>>>>>Alian [
> >> >>>>>>>>***@wisc.edu]
> >> >>>>>>>>Sent: Friday, June 26, 2015 7:37 PM
> >> >>>>>>>>To: gem5 Developer List
> >> >>>>>>>>Subject: Re: [gem5-dev] pd-gem5: simulating a
> >> >>parallel/distributed
> >> >>>>>>>>system
> >> >>>>>>>>on multiple physical hosts
> >> >>>>>>>>
> >> >>>>>>>>Hi Anthony,
> >> >>>>>>>>
> >> >>>>>>>>I think that would be a good option, then I can add pd-gem5
> >> >>>>>>>>functionality
> >> >>>>>>>>on top of that. Right now I've simplified your implementation.
> >> >>>>Also, I
> >> >>>>>>>>think I had found some bugs in your patch that I cannot remember
> >> >>>>now.
> >> >>>>>>If
> >> >>>>>>>>you decided to ship EtherSwitch patch, let me know to give you a
> >> >>>>>>review
> >> >>>>>>>>on
> >> >>>>>>>>that.
> >> >>>>>>>>
> >> >>>>>>>>Thanks,
> >> >>>>>>>>Mohammad
> >> >>>>>>>>
> >> >>>>>>>>On Thu, Jun 25, 2015 at 8:36 PM, Gutierrez, Anthony <
> >> >>>>>>>>***@amd.com> wrote:
> >> >>>>>>>>
> >> >>>>>>>>>Would it make sense for me to ship the EtherSwitch patch first,
> >> >>>>since
> >> >>>>>>>>it
> >> >>>>>>>>>has utility on its own, and then we can decide which of the
> >> >>>>>>>>"multi-gem5"
> >> >>>>>>>>>approaches is best, or if it's some combination of both?
> >> >>>>>>>>>
> >> >>>>>>>>>The only reason I never shipped it was because Steve raised an
> >> >>>>issue
> >> >>>>>>>>that
> >> >>>>>>>>>I didn't have a good alternative for, and didn't have the time
> >> >>to
> >> >>>>>>look
> >> >>>>>>>>into
> >> >>>>>>>>>one at that time.
> >> >>>>>>>>>________________________________________
> >> >>>>>>>>>From: gem5-dev [gem5-dev-***@gem5.org] on behalf of
> >>Mohammad
> >> >>>>>>>>Alian [
> >> >>>>>>>>>***@wisc.edu]
> >> >>>>>>>>>Sent: Wednesday, June 24, 2015 12:43 PM
> >> >>>>>>>>>To: gem5 Developer List
> >> >>>>>>>>>Subject: Re: [gem5-dev] pd-gem5: simulating a
> >> >>parallel/distributed
> >> >>>>>>>>system
> >> >>>>>>>>>on multiple physical hosts
> >> >>>>>>>>>
> >> >>>>>>>>>Hi Andreas,
> >> >>>>>>>>>
> >> >>>>>>>>>Thanks for the comment.
> >> >>>>>>>>>I think the checkpointing support in both works is the same.
> >> >>Here
> >> >>>>is
> >> >>>>>>>>how
> >> >>>>>>>>>checkpointing support is implemented in pd-gem5:
> >> >>>>>>>>>
> >> >>>>>>>>>Whenever one of gem5 processes encounter an m5-checkpoint
> >>pseudo
> >> >>>>>>>>>instruction, it will send a ³recv-ckpt² signal to the
> >> >>>>>>>>>³barrier² process. Then the ³barrier² process sends a
> >> >>³take-ckpt²
> >> >>>>>>>>signal
> >> >>>>>>>>to
> >> >>>>>>>>>all the simulated nodes
> >> >>>>>>>>>(including the node that encountered m5-checkpoint) at the end
> >> >>of
> >> >>>>the
> >> >>>>>>>>>current simulation quantum. On the reception of
> >> >>>>>>>>>³take-ckpt² signal, gem5 processes start dumping check-points.
> >> >>>>This
> >> >>>>>>>>makes
> >> >>>>>>>>>each simulated node dump a checkpoint
> >> >>>>>>>>>at the same simulated time point while ensuring there is no
> >> >>>>in-flight
> >> >>>>>>>>>packets.
> >> >>>>>>>>>
> >> >>>>>>>>>I believe this is the same as multi-gem5 patch approach for
> >> >>>>>>checkpoint
> >> >>>>>>>>>support (based on the commit message of
> >> >>>>>>>>http://reviews.gem5.org/r/2865/
> >> >>>>>>>>).
> >> >>>>>>>>>Also, we have tested our mechanism with several benchmarks and
> >> >>it
> >> >>>>>>>>works.
> >> >>>>>>>>As
> >> >>>>>>>>>Steve suggested, I'll look into Curtis's patch and try to
> >>review
> >> >>>>it
> >> >>>>>>as
> >> >>>>>>>>>well.
> >> >>>>>>>>>But as Nilay also mentioned earlier, there are some codes
> >> >>missing
> >> >>>>in
> >> >>>>>>>>>Curtis's patch. I prefer to first run multi-gem5 before
> >>starting
> >> >>>>to
> >> >>>>>>>>review
> >> >>>>>>>>>it.
> >> >>>>>>>>>
> >> >>>>>>>>>Thank you,
> >> >>>>>>>>>Mohammad
> >> >>>>>>>>>
> >> >>>>>>>>>On Wed, Jun 24, 2015 at 7:25 AM, Andreas Hansson <
> >> >>>>>>>>***@arm.com>
> >> >>>>>>>>>wrote:
> >> >>>>>>>>>
> >> >>>>>>>>>>Hi Steve,
> >> >>>>>>>>>>
> >> >>>>>>>>>>Apologies for the confusion. We are on the same page. My point
> >> >>is
> >> >>>>>>>>that
> >> >>>>>>>>we
> >> >>>>>>>>>>cannot simply take a little bit of patch A and a little bit of
> >> >>>>>>>>patch B.
> >> >>>>>>>>>>This change involves a lot of code, and we need to approach
> >> >>this
> >> >>>>in
> >> >>>>>>>>a
> >> >>>>>>>>>>structured fashion. My proposal is to do it bottom up, and
> >> >>start
> >> >>>>by
> >> >>>>>>>>>>getting the basic support in place. Since
> >> >>>>>>>>>http://reviews.gem5.org/r/2826/
> >> >>>>>>>>>>has already been on the review board for a few months, I am
> >> >>>>merely
> >> >>>>>>>>>>suggesting that the it would be a good start to relate the
> >> >>newly
> >> >>>>>>>>posted
> >> >>>>>>>>>>patches to what is already there.
> >> >>>>>>>>>>
> >> >>>>>>>>>>Andreas
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>On 24/06/2015 13:11, "gem5-dev on behalf of Steve Reinhardt"
> >> >>>>>>>>>><gem5-dev-***@gem5.org on behalf of ***@gmail.com>
> >> >>wrote:
> >> >>>>>>>>>>
> >> >>>>>>>>>>>Hi Andreas,
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>I'm a little confused by your email---you say you're
> >> >>>>fundamentally
> >> >>>>>>>>>opposed
> >> >>>>>>>>>>>to looking at both patches and picking the best features,
> >>then
> >> >>>>you
> >> >>>>>>>>point
> >> >>>>>>>>>>>out that the patches Curtis posted have the feature of better
> >> >>>>>>>>>>>checkpointing
> >> >>>>>>>>>>>support so we should pick that :).
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>Obviously we can't just pick patch A from Mohammad's set and
> >> >>>>patch
> >> >>>>>>>>B
> >> >>>>>>>>>from
> >> >>>>>>>>>>>Curtis's set and expect them to work together, but I think
> >> >>that
> >> >>>>>>>>having
> >> >>>>>>>>>>>both
> >> >>>>>>>>>>>sets of patches available and comparing and contrasting the
> >> >>two
> >> >>>>>>>>>>>implementations should enable us to get to a single
> >> >>>>implementation
> >> >>>>>>>>>that's
> >> >>>>>>>>>>>the best of both. Someone will have to make the effort of
> >> >>>>>>>>integrating
> >> >>>>>>>>>the
> >> >>>>>>>>>>>better ideas from one set into the other set to create a new
> >> >>>>>>>>unified
> >> >>>>>>>>set
> >> >>>>>>>>>>>of
> >> >>>>>>>>>>>patches; (or maybe we commit one set and then integrate the
> >> >>>>best of
> >> >>>>>>>>the
> >> >>>>>>>>>>>other set as patches on top of that), but the first step is
> >>to
> >> >>>>>>>>identify
> >> >>>>>>>>>>>what "the best of both" is. Having Mohammad look at Curtis's
> >> >>>>>>>>patches,
> >> >>>>>>>>>and
> >> >>>>>>>>>>>Curtis (or someone else from ARM) closely examine Mohammad's
> >> >>>>>>>>patches
> >> >>>>>>>>>would
> >> >>>>>>>>>>>be a great start. I intend to review them both, though
> >> >>>>>>>>unfortunately
> >> >>>>>>>>my
> >> >>>>>>>>>>>time has been scarce lately---I'm hoping to squeeze that in
> >> >>>>later
> >> >>>>>>>>this
> >> >>>>>>>>>>>week.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>Once we've had a few people look at both, we can discuss the
> >> >>>>pros
> >> >>>>>>>>and
> >> >>>>>>>>>cons
> >> >>>>>>>>>>>of each, then discuss the strategy for getting the best
> >> >>features
> >> >>>>>>>>in.
> >> >>>>>>>>So
> >> >>>>>>>>>>>far I've heard that Mohammad's patches have a better network
> >> >>>>model
> >> >>>>>>>>but
> >> >>>>>>>>>the
> >> >>>>>>>>>>>ARM patches have better checkpointing support; that seems
> >> >>like a
> >> >>>>>>>>good
> >> >>>>>>>>>>>start.
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>Steve
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>On Wed, Jun 24, 2015 at 12:26 AM Andreas Hansson <
> >> >>>>>>>>>***@arm.com
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>>>Hi all,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>Great work. However, I fundamentally do not believe in the
> >> >>>>>>>>approach
> >> >>>>>>>>of
> >> >>>>>>>>>>>>Œletting reviewers pick the best features¹. There is no way
> >> >>we
> >> >>>>>>>>would
> >> >>>>>>>>>>>>ever
> >> >>>>>>>>>>>>get something working out if it. We need to get _one_
> >>working
> >> >>>>>>>>solution
> >> >>>>>>>>>>>>here, and figure out how to best get there. I would propose
> >> >>to
> >> >>>>>>>>do it
> >> >>>>>>>>>>>>bottom up, starting with the basic multi-simulator instance
> >> >>>>>>>>support,
> >> >>>>>>>>>>>>checkpointing support, and then move on to the network
> >> >>between
> >> >>>>>>>>the
> >> >>>>>>>>>>>>simulator instances.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>Thus, I propose we go with the low-level plumbing and
> >> >>>>checkpoint
> >> >>>>>>>>>support
> >> >>>>>>>>>>>>from what Curtis has posted. I believe proper checkpointing
> >> >>>>>>>>support
> >> >>>>>>>>to
> >> >>>>>>>>>>>>be
> >> >>>>>>>>>>>>the most challenging, and from what I can tell this is far
> >> >>more
> >> >>>>>>>>>limited
> >> >>>>>>>>>>>>in
> >> >>>>>>>>>>>>what you just posted Mohammad. Could you perhaps review
> >> >>Curtis
> >> >>>>>>>>patches
> >> >>>>>>>>>>>>based on your insights, and we can try and get these patches
> >> >>in
> >> >>>>>>>>shape
> >> >>>>>>>>>>>>and
> >> >>>>>>>>>>>>committed asap.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>Once we have the baseline functionality in place, then we
> >>can
> >> >>>>>>>>start
> >> >>>>>>>>>>>>looking at the more elaborate network models.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>Does this sound reasonable?
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>Thanks,
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>Andreas
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>On 24/06/2015 05:05, "gem5-dev on behalf of Mohammad Alian"
> >> >>>>>>>>>>>><gem5-dev-***@gem5.org on behalf of ***@wisc.edu>
> >> >>wrote:
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>>Hello All,
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>I have submitted a chain of patches which enables gem5 to
> >> >>>>>>>>simulate
> >> >>>>>>>>a
> >> >>>>>>>>>>>>>cluster on multiple physical hosts:
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>http://reviews.gem5.org/r/2909/
> >> >>>>>>>>>>>>>http://reviews.gem5.org/r/2910/
> >> >>>>>>>>>>>>>http://reviews.gem5.org/r/2912/
> >> >>>>>>>>>>>>>http://reviews.gem5.org/r/2913/
> >> >>>>>>>>>>>>>http://reviews.gem5.org/r/2914/
> >> >>>>>>>><http://reviews.gem5.org/r/2914/>
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>and a patch that contains run scripts for a simple
> >> >>experiment:
> >> >>>>>>>>>>>>>http://reviews.gem5.org/r/2915/
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>We have run several benchmarks using this infrastructure,
> >> >>>>>>>>including
> >> >>>>>>>>>NAS
> >> >>>>>>>>>>>>>parallel benchmarks (MPI) and DCBench-hadoop
> >> >>>>>>>>>>>>>(http://prof.ict.ac.cn/DCBench/),
> >> >>>>>>>>>>>>>and would be happy to share scripts/diskimages.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>We call this *pd-gem5*. *pd-gem5 *functionality is more or
> >> >>>>less
> >> >>>>>>>>the
> >> >>>>>>>>>>>>same
> >> >>>>>>>>>>>>>as
> >> >>>>>>>>>>>>>Curtis's patch for *multi-gem5.* However, I feel *pd-gem5
> >> >>>>>>>>*network
> >> >>>>>>>>>>>>model
> >> >>>>>>>>>>>>>is
> >> >>>>>>>>>>>>>more thorough; it also enables modeling different network
> >> >>>>>>>>topologies.
> >> >>>>>>>>>>>>>Having both set of changes together let reviewers to pick
> >> >>best
> >> >>>>>>>>>features
> >> >>>>>>>>>>>>>from both works.
> >> >>>>>>>>>>>>>
> >> >>>>>>>>>>>>>Thank you,
> >> >>>>>>>>>>>>>Mohammad Alian
> >> >>>>>>>>>>>>>_______________________________________________
> >> >>>>>>>>>>>>>gem5-dev mailing list
> >> >>>>>>>>>>>>>gem5-***@gem5.org
> >> >>>>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >> >>>>>>>>attachments
> >> >>>>>>>>>are
> >> >>>>>>>>>>>>confidential and may also be privileged. If you are not the
> >> >>>>>>>>intended
> >> >>>>>>>>>>>>recipient, please notify the sender immediately and do not
> >> >>>>>>>>disclose
> >> >>>>>>>>>the
> >> >>>>>>>>>>>>contents to any other person, use it for any purpose, or
> >> >>store
> >> >>>>or
> >> >>>>>>>>copy
> >> >>>>>>>>>>>>the
> >> >>>>>>>>>>>>information in any medium. Thank you.
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge
> >> >>CB1
> >> >>>>>>>>9NJ,
> >> >>>>>>>>>>>>Registered in England & Wales, Company No: 2557590
> >> >>>>>>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
> >> >>>>Cambridge
> >> >>>>>>>>CB1
> >> >>>>>>>>>>>>9NJ,
> >> >>>>>>>>>>>>Registered in England & Wales, Company No: 2548782
> >> >>>>>>>>>>>>_______________________________________________
> >> >>>>>>>>>>>>gem5-dev mailing list
> >> >>>>>>>>>>>>gem5-***@gem5.org
> >> >>>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>>>>>>>
> >> >>>>>>>>>>>_______________________________________________
> >> >>>>>>>>>>>gem5-dev mailing list
> >> >>>>>>>>>>>gem5-***@gem5.org
> >> >>>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >> >>>>attachments
> >> >>>>>>>>are
> >> >>>>>>>>>>confidential and may also be privileged. If you are not the
> >> >>>>intended
> >> >>>>>>>>>>recipient, please notify the sender immediately and do not
> >> >>>>disclose
> >> >>>>>>>>the
> >> >>>>>>>>>>contents to any other person, use it for any purpose, or store
> >> >>or
> >> >>>>>>>>copy
> >> >>>>>>>>>the
> >> >>>>>>>>>>information in any medium. Thank you.
> >> >>>>>>>>>>
> >> >>>>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge
> >>CB1
> >> >>>>9NJ,
> >> >>>>>>>>>>Registered in England & Wales, Company No: 2557590
> >> >>>>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road,
> >> >>Cambridge
> >> >>>>CB1
> >> >>>>>>>>9NJ,
> >> >>>>>>>>>>Registered in England & Wales, Company No: 2548782
> >> >>>>>>>>>>_______________________________________________
> >> >>>>>>>>>>gem5-dev mailing list
> >> >>>>>>>>>>gem5-***@gem5.org
> >> >>>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>>>>>
> >> >>>>>>>>>_______________________________________________
> >> >>>>>>>>>gem5-dev mailing list
> >> >>>>>>>>>gem5-***@gem5.org
> >> >>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>>>>_______________________________________________
> >> >>>>>>>>>gem5-dev mailing list
> >> >>>>>>>>>gem5-***@gem5.org
> >> >>>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>>>>
> >> >>>>>>>>_______________________________________________
> >> >>>>>>>>gem5-dev mailing list
> >> >>>>>>>>gem5-***@gem5.org
> >> >>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>>>
> >> >>>>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >> >>attachments
> >> >>>>>>are
> >> >>>>>>>>confidential and may also be privileged. If you are not the
> >> >>>>intended
> >> >>>>>>>>recipient, please notify the sender immediately and do not
> >> >>disclose
> >> >>>>>>the
> >> >>>>>>>>contents to any other person, use it for any purpose, or store
> >>or
> >> >>>>copy
> >> >>>>>>>>the
> >> >>>>>>>>information in any medium. Thank you.
> >> >>>>>>>>
> >> >>>>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >> >>>>9NJ,
> >> >>>>>>>>Registered in England & Wales, Company No: 2557590
> >> >>>>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> >> >>>>CB1
> >> >>>>>>>>9NJ,
> >> >>>>>>>>Registered in England & Wales, Company No: 2548782
> >> >>>>>>>>
> >> >>>>>>>>_______________________________________________
> >> >>>>>>>>gem5-dev mailing list
> >> >>>>>>>>gem5-***@gem5.org
> >> >>>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>>>
> >> >>>>>>>_______________________________________________
> >> >>>>>>>gem5-dev mailing list
> >> >>>>>>>gem5-***@gem5.org
> >> >>>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>-- IMPORTANT NOTICE: The contents of this email and any
> >>attachments
> >> >>>>are
> >> >>>>>>confidential and may also be privileged. If you are not the
> >> >>intended
> >> >>>>>>recipient, please notify the sender immediately and do not
> >>disclose
> >> >>>>the
> >> >>>>>>contents to any other person, use it for any purpose, or store or
> >> >>>>copy
> >> >>>>>>the
> >> >>>>>>information in any medium. Thank you.
> >> >>>>>>
> >> >>>>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
> >> >>9NJ,
> >> >>>>>>Registered in England & Wales, Company No: 2557590
> >> >>>>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
> >> >>CB1
> >> >>>>>>9NJ,
> >> >>>>>>Registered in England & Wales, Company No: 2548782
> >> >>>>>>_______________________________________________
> >> >>>>>>gem5-dev mailing list
> >> >>>>>>gem5-***@gem5.org
> >> >>>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>>>
> >> >>>>>_______________________________________________
> >> >>>>>gem5-dev mailing list
> >> >>>>>gem5-***@gem5.org
> >> >>>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>
> >> >>>>
> >> >>>>-- IMPORTANT NOTICE: The contents of this email and any attachments
> >> >>are
> >> >>>>confidential and may also be privileged. If you are not the intended
> >> >>>>recipient, please notify the sender immediately and do not disclose
> >> >>the
> >> >>>>contents to any other person, use it for any purpose, or store or
> >>copy
> >> >>>>the
> >> >>>>information in any medium. Thank you.
> >> >>>>
> >> >>>>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >> >>>>Registered in England & Wales, Company No: 2557590
> >> >>>>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >> >>>>9NJ,
> >> >>>>Registered in England & Wales, Company No: 2548782
> >> >>>>_______________________________________________
> >> >>>>gem5-dev mailing list
> >> >>>>gem5-***@gem5.org
> >> >>>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>>>
> >> >>>_______________________________________________
> >> >>>gem5-dev mailing list
> >> >>>gem5-***@gem5.org
> >> >>>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>-- IMPORTANT NOTICE: The contents of this email and any attachments
> >>are
> >> >>confidential and may also be privileged. If you are not the intended
> >> >>recipient, please notify the sender immediately and do not disclose
> >>the
> >> >>contents to any other person, use it for any purpose, or store or copy
> >> >>the
> >> >>information in any medium. Thank you.
> >> >>
> >> >>ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >> >>Registered in England & Wales, Company No: 2557590
> >> >>ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >> >>9NJ,
> >> >>Registered in England & Wales, Company No: 2548782
> >> >>_______________________________________________
> >> >>gem5-dev mailing list
> >> >>gem5-***@gem5.org
> >> >>http://m5sim.org/mailman/listinfo/gem5-dev
> >> >>
> >> >_______________________________________________
> >> >gem5-dev mailing list
> >> >gem5-***@gem5.org
> >> >http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >>
> >>
> >>
> >>
> >>
> >> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> >> confidential and may also be privileged. If you are not the intended
> >> recipient, please notify the sender immediately and do not disclose the
> >> contents to any other person, use it for any purpose, or store or copy
> >>the
> >> information in any medium. Thank you.
> >>
> >> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> >> Registered in England & Wales, Company No: 2557590
> >> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1
> >>9NJ,
> >> Registered in England & Wales, Company No: 2548782
> >> _______________________________________________
> >> gem5-dev mailing list
> >> gem5-***@gem5.org
> >> http://m5sim.org/mailman/listinfo/gem5-dev
> >>
> >_______________________________________________
> >gem5-dev mailing list
> >gem5-***@gem5.org
> >http://m5sim.org/mailman/listinfo/gem5-dev
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
> _______________________________________________
> gem5-dev mailing list
> gem5-***@gem5.org
> http://m5sim.org/mailman/listinfo/gem5-dev
>
Gabor Dozsa
2015-07-07 16:05:11 UTC
Permalink
Mohammad, I’m not sure what you mean by “taking a checkpoint with quantum
size smaller than link latency”.

In multi-gem5, thequantum size and the checkpoint is completely
independent. The quantum is the number of ticks simulated between two
consecutive periodic sync - that’s why every periodic sync happens at the
same tick at each gem5 process. A checkpoint can be taken at any point
within a quantum. After the checkpoint is taken, each gem5 rocess
completes what remained from the current quantum and then enters the next
periodic sync.

When fast-forwarding, you can increase link latency to allow larger
quantum and reduce periodic sync overhead. Does that make sense?

- Gabor

On 7/7/15, 4:11 PM, "Mohammad Alian" <***@wisc.edu> wrote:

>Then you are assuming taking checkpoint with quantum size smaller than
>link
>latency which contradicts your initial motivation for unsync checkpoint!:
>(I copied this sentence from earlier messages in the thread as a reminder)
>"Shortening the quantum canhelp, but usually the snapshot is being taken
>while 'fast-forwarding', i.e. simulating as fast s possible, which would
>motivate a longer quantum."
>
>What if somebody wants t relax synchronization and take checkpoint?
>
>On Tue, Jul 7, 2015 at 7:38 AM, Gabor Dozsa <***@arm.com> wrote:
>
>>
>> Hi Mohammad and all,
>>
>> gem5 processes may restore at a different tick from a checkpoint but the
>> next periodic sync will hapen at the same tick in all gem5. A receive
>> tick of a packet cannot fall into the current quantum so every packet
>>can
>> get scheduled for eceive properly even if a checkpoint/restore happens
>> during a quantum.
>>
>> Regarding your multi-threaded dual config, my understanding is that
>> EtherLink is not prepared to work with multi threading as it lacks
>>thread
>> safety. The multiple event queues/threads config only works if the
>>systems
>> are independent.
>>
>> One possible way to fix that is to provide a "multi-thread” based
>> implementation for MultiIface ;-)
>>
>> - Gabor
>>
>> On 7/7/15, 6:29 AM, "Mohammad Alian" <***@wisc.edu> wrote:
>>
>> >Gabor- My concern about unsync checkpoint is that when you restore
>>from an
>> >unsync checkpoin, you'll have gem5 processes that each is running in
>> >different tick. Then how do you handle accurate delivery of packets
>> >between
>> >these gem5 processes? It willalso make it harder to integrate
>> >multi/pd-gem5 with current multi-threaded gem5. The problem with sync
>> >checkpoint is that you cannot exactl take checkpoint at ROI, but I
>>think
>> >unsync checkpoint introduces some other problems. Considering the
>> >necessary
>> >warmup periodbefore starting stat collection, I think we don't need to
>> >exactly pinpoint the ROI. Please correct me if I'm wrong.
>> >
>> >I'm trying to run a multi-threaded experiment with pd-gem5, but I got
>>an
>> >error when I tried to partition dual mode simulation on two threads. I
>> >posted that in gem5 users mailing list. Please help me on that if you
>>can.
>> >
>> >Thank you,
>> >Mohammad
>> >
>> >On Mon, Jul 6, 2015 at 11:45 AM, Gabor Dozsa <***@arm.com>
>>wrote:
>> >
>> >> Thank you Steve for the detaile elaboration on the issues.
>> >>
>> >>
>> >> Regarding the “unsynchronized checkpoints”, the terminology might be
>>a
>> >>bit
>> >> confusing. In fact, w always need to do a global synchronization
>>among
>> >> the gem5 processes bfore taking a distributed checkpoint (in order
>>to
>> >> avoid in-flight packets). The global synchronization here means that
>> >>each
>> >> gem5 has to suspend the simulation and wait until every in-flight
>> >>packets
>> >> arrives (and is stored) at the destination gem5 process. If that
>>global
>> >> synchronization step happens at the same simulated tick in each gem5
>> >>then
>> >> the we call the checkpoint “synchronous” otherwise it is an
>> >>“asynchronous”
>> >> checkpoint.
>> >>
>> >> In the MPI application example I mentioned before the checkpoint
>>should
>> >>be
>> >> triggered as soon as the “slowest” MPI process reaches the
>> >>MPI_barrier().
>> >> The problem is that the “slowest” MPI process usually does not reach
>>the
>> >> MPI_barrier() right at the end of the current quantum. If we let the
>> >> simulation continue until the quantum completes (to ensure that the
>> >> checkpoint is taken at the same simulated tick n each gem5) then the
>> >>MPI
>> >> processes will complete the MPI_barrier and start executing the ROI
>>code
>> >> already.
>> >>
>> >> Regarding the integration of multi-threaded/multi-host simulation,
>> >> multi-gem5 does not support fine grainsimulation of hierarchical
>> >>switches
>> >> (or any other network topologies except a single crossbar) or
>>multiple
>> >> synchronization domains currently.
>> >>
>> >> However, I'm a bit confused about your statement that you don’t see
>> >>value
>> >> in ever building a shared-memory transport for MultiIface.
>>MultiIface in
>> >> my view is just an abstract interface for “multi-(ether)-link"
>>objects
>> >> which are link objects for connecting multiple (i.e. more than two)
>> >> systems. It aims to encapsulate the API necessary for any Link object
>> >> in a any multi-system configuration - provided that we partition the
>> >> systems across network links during run time.
>> >>
>> >> An orthogonal issue is if we want to include a simple crossbar switch
>> >> model in a MultiIface implementation or we want to provide a
>> >>‘standalone'
>> >> fine grain model for the switch (e.g. the pd-gem5 approach).
>> >>
>> >> Thanks,
>> >> - Gabor
>> >>
>> >>
>> >>
>> >> On 7/3/15, 7:33 PM, "Steve Reinhardt" <***@gmail.com> wrote:
>> >>
>> >> >Thanks Mohammad & Gabor for the responses.
>> >> >
>> >> >I think there's still some misunderstanding on what I mean by the
>> >> >integration of multi-threaded and multi-host simulation based on
>> >>Gabor's
>> >> >response above and Andreas's response in the other thread.
>> >> >
>> >> >The primary example scenario I'm proposing is as Mohammad described:
>> >> >within
>> >> >each host node, we're simulating an entire rack + top-of-rack switch
>> >>in a
>> >> >single gem5 process, with separate event queues/threads being used
>>to
>> >> >parallelize across nodes within the rack. The switch may or may not
>>be
>> >>on
>> >> >its own thread as well. The synchronization among the threads only
>> >>needs
>> >> >to be at the granularity of the intra-rack network latency.
>> >> >
>> >> >Now we want to expand this by using pd-gem5 or multi-gem5 to
>> >>parallelize
>> >> >multiple of these rack-level simulations across hosts, so we can
>> >>simulate
>> >> >a
>> >> >whole row of a datacenter. Only the uplinks from the TOR switches
>> >>would
>> >> >need to go over sockets between processes, and the switch being
>> >>modeled by
>> >> >pd-gem5 or multi-gem5 would be the end-of-row switch. The
>> >>synchronization
>> >> >delay among the multiple gem5 processes would be based on the
>> >>inter-rack
>> >> >latency.
>> >> >
>> >> >So the basic question is: Is this feasible with pd-gem5 /
>>multi-gem5,
>> >>and
>> >> >if not, how much work would it take to make it so?
>> >> >
>> >> >However, my larger point is that I still don't see value in ever
>> >>building
>> >> >a
>> >> >s