Discussion:
[gem5-dev] Review Request: Forward invalidations from Ruby to O3 CPU
(too old to reply)
Nilay Vaish
2011-10-18 06:50:47 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

Review request for Default.


Summary
-------

This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.

This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.

My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --

* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?

* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
it.


Diffs
-----

build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Korey Sewell
2011-10-18 14:38:59 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1611
-----------------------------------------------------------



build_opts/ALPHA_SE_MESI_CMP_directory
<http://reviews.m5sim.org/r/894/#comment2092>

Do you really want to remove support for all the other CPU models?



configs/ruby/MESI_CMP_directory.py
<http://reviews.m5sim.org/r/894/#comment2090>

Hi Nilay,
thanks for starting the work on this. I've been waffling on giving it a go for awhile now, so hopefully your push will get this rolling.

Do you think that sending invalidations should be optional? For testing, this would be useful, but in general I would think you should always forward the invalidations to the CPU model and then the CPU model would choose to use that info or not (for instance, a inorder model may even buffer multiple speculative loads behind a blocking memory op)



src/mem/ruby/system/RubyPort.cc
<http://reviews.m5sim.org/r/894/#comment2091>

Sanity Checks:
(1) Is express snoop right here? That will make the snoop instantaneous right? Is that necessary if this just going directly to an L1?

(2) Also, where does the meminhibit flag get deasserted?


- Korey
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-10-18 15:16:44 UTC
Permalink
Post by Nilay Vaish
build_opts/ALPHA_SE_MESI_CMP_directory, line 3
<http://reviews.m5sim.org/r/894/diff/1/?file=15295#file15295line3>
Do you really want to remove support for all the other CPU models?
I did that just for testing purpose. I will ensure that this
not part of the final commit.
Post by Nilay Vaish
configs/ruby/MESI_CMP_directory.py, line 75
<http://reviews.m5sim.org/r/894/diff/1/?file=15297#file15297line75>
Hi Nilay,
thanks for starting the work on this. I've been waffling on giving it a go for awhile now, so hopefully your push will get this rolling.
Do you think that sending invalidations should be optional? For testing, this would be useful, but in general I would think you should always forward the invalidations to the CPU model and then the CPU model would choose to use that info or not (for instance, a inorder model may even buffer multiple speculative loads behind a blocking memory op)
In case of timingSimpleCPU, the CPU stalls until the access is
made. I don't think that the CPU needs invalidations / replacements
from Ruby. Hence, I made that option available. We can always set it
so that both detailed and inorder CPU's would receive invalidations
but timingSimpleCPU would not.
Post by Nilay Vaish
src/mem/ruby/system/RubyPort.cc, line 685
<http://reviews.m5sim.org/r/894/diff/1/?file=15301#file15301line685>
(1) Is express snoop right here? That will make the snoop instantaneous right? Is that necessary if this just going directly to an L1?
(2) Also, where does the meminhibit flag get deasserted?
With a very high probability, I might be doing things incorrectly.
I copied this code from the src/mem/cache/cache_impl.hh. Since Ruby
only has directory-based protocols currently, so this is not exactly
instantaneous. We can schedule an event in future, if that's what we
want. But from reading the code in cache_impl.hh, it seemed to me that
invalidations are happening instantaneously.

As far as asserting MEM_INHIBIT is concerned, it might be that it is
completely unnecessary. I don't know why it is asserted in case of the
classic memory system.


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1611
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-10-20 18:36:18 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1613
-----------------------------------------------------------



src/mem/ruby/system/RubyPort.cc
<http://reviews.m5sim.org/r/894/#comment2096>

I figured out that the packet may not be processed at this point in time. But may be scheduled for processing at a later time. Is it assured that the receiver will always delete the packet and request?


- Nilay
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Steve Reinhardt
2011-10-20 19:08:10 UTC
Permalink
That's how it's supposed to work... the target is responsible for deleting
the packet, though it often simply reuses the packet for the response
message.

Steve
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/#review1613
-----------------------------------------------------------
src/mem/ruby/system/RubyPort.cc
<http://reviews.m5sim.org/r/894/#comment2096>
I figured out that the packet may not be processed at this point in
time. But may be scheduled for processing at a later time. Is it assured
that the receiver will always delete the packet and request?
- Nilay
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a
replacement
Post by Nilay Vaish
or an invalidation is performed, the L1 cache forwards this to all the
ports,
Post by Nilay Vaish
which I believe is the LSQ in case of the O3 CPU. Those who understand
the O3
Post by Nilay Vaish
LSQ should take a close look at the implementation and figure out (at
least
Post by Nilay Vaish
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify
other
Post by Nilay Vaish
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of
the
Post by Nilay Vaish
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby
will
Post by Nilay Vaish
forward the invalidations, but not the part where the LSQ needs to act
on
Post by Nilay Vaish
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nilay Vaish
2011-10-20 19:26:57 UTC
Permalink
What about the underlying request?

--
Nilay
Post by Steve Reinhardt
That's how it's supposed to work... the target is responsible for deleting
the packet, though it often simply reuses the packet for the response
message.
Steve
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/#review1613
-----------------------------------------------------------
src/mem/ruby/system/RubyPort.cc
<http://reviews.m5sim.org/r/894/#comment2096>
I figured out that the packet may not be processed at this point in
time. But may be scheduled for processing at a later time. Is it assured
that the receiver will always delete the packet and request?
- Nilay
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a
replacement
Post by Nilay Vaish
or an invalidation is performed, the L1 cache forwards this to all the
ports,
Post by Nilay Vaish
which I believe is the LSQ in case of the O3 CPU. Those who understand
the O3
Post by Nilay Vaish
LSQ should take a close look at the implementation and figure out (at
least
Post by Nilay Vaish
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify
other
Post by Nilay Vaish
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of
the
Post by Nilay Vaish
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby
will
Post by Nilay Vaish
forward the invalidations, but not the part where the LSQ needs to act
on
Post by Nilay Vaish
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2011-10-20 19:36:48 UTC
Permalink
Sorry I missed that... the request is owned by the requestor, since it's
just a pointer to the request that's copied from the request packet to the
response packet.
Post by Nilay Vaish
What about the underlying request?
--
Nilay
That's how it's supposed to work... the target is responsible for deleting
Post by Steve Reinhardt
the packet, though it often simply reuses the packet for the response
message.
Steve
------------------------------**-----------------------------
http://reviews.m5sim.org/r/**894/#review1613<http://reviews.m5sim.org/r/894/#review1613>
------------------------------**-----------------------------
src/mem/ruby/system/RubyPort.**cc
<http://reviews.m5sim.org/r/**894/#comment2096<http://reviews.m5sim.org/r/894/#comment2096>
I figured out that the packet may not be processed at this point in
time. But may be scheduled for processing at a later time. Is it assured
that the receiver will always delete the packet and request?
- Nilay
------------------------------**-----------------------------
http://reviews.m5sim.org/r/**894/ <http://reviews.m5sim.org/r/894/>
------------------------------**-----------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a
replacement
or an invalidation is performed, the L1 cache forwards this to all the
ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand
the O3
LSQ should take a close look at the implementation and figure out (at
least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify
other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of
the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby
will
forward the invalidations, but not the part where the LSQ needs to act
on
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_**directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_**directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_**directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_**Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.**hh 92ba80d63abc
src/mem/ruby/system/RubyPort.**cc 92ba80d63abc
src/mem/ruby/system/Sequencer.**hh 92ba80d63abc
src/mem/ruby/system/Sequencer.**cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/**894/diff<http://reviews.m5sim.org/r/894/diff>
Testing
-------
Thanks,
Nilay
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
Nilay Vaish
2011-10-20 20:18:40 UTC
Permalink
Then, when should the requestor delete the request, assuming that there is
not going to be any response?

--
Nilay
Post by Steve Reinhardt
Sorry I missed that... the request is owned by the requestor, since it's
just a pointer to the request that's copied from the request packet to the
response packet.
Post by Nilay Vaish
What about the underlying request?
--
Nilay
That's how it's supposed to work... the target is responsible for deleting
Post by Steve Reinhardt
the packet, though it often simply reuses the packet for the response
message.
Steve
------------------------------**-----------------------------
http://reviews.m5sim.org/r/**894/#review1613<http://reviews.m5sim.org/r/894/#review1613>
------------------------------**-----------------------------
src/mem/ruby/system/RubyPort.**cc
<http://reviews.m5sim.org/r/**894/#comment2096<http://reviews.m5sim.org/r/894/#comment2096>
I figured out that the packet may not be processed at this point in
time. But may be scheduled for processing at a later time. Is it assured
that the receiver will always delete the packet and request?
- Nilay
Brad Beckmann
2011-10-28 05:35:20 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1620
-----------------------------------------------------------


Thanks for the heads up on this patch. I'm glad you found the time to dive into it.



I'm confused that the comment mentions a "list of ports", but I don't see a list of ports in the code and I'm not sure how would even be used?

The two questions you pose are good ones. Hopefully someone who understands the O3 LSQ can answer the first, and I would suggest creating a new directed test that can manipulate the enqueue latency on the mandatory queue to create the necessary test situations.

Also, I have a couple high-level comments right now:



- Ruby doesn't implement any particular memory model. It just implements the cache coherence protocol, and more specifically invalidation based protocols. The protocol, in combination with the core model, results in the memory model.


- I don't think it is sufficient to just forward those probes that hit valid copies to the O3 model. What about replacements of blocks that have serviced a speculative load? Instead, my thought would be to forward all probes to the O3 LSQ and think of cpu-controlled policies to filter out unecessary probes.

- Brad
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-10-28 10:32:25 UTC
Permalink
Post by Nilay Vaish
Post by Brad Beckmann
Thanks for the heads up on this patch. I'm glad you found the time to dive into it.
I'm confused that the comment mentions a "list of ports", but I don't see a list of ports in the code and I'm not sure how would even be used?
The two questions you pose are good ones. Hopefully someone who understands the O3 LSQ can answer the first, and I would suggest creating a new directed test that can manipulate the enqueue latency on the mandatory queue to create the necessary test situations.
- Ruby doesn't implement any particular memory model. It just implements the cache coherence protocol, and more specifically invalidation based protocols. The protocol, in combination with the core model, results in the memory model.
- I don't think it is sufficient to just forward those probes that hit valid copies to the O3 model. What about replacements of blocks that have serviced a speculative load? Instead, my thought would be to forward all probes to the O3 LSQ and think of cpu-controlled policies to filter out unecessary probes.
Hi Brad, thanks for the response.

* A list of ports has been added to RubyPort.hh, the ports are added
to the list whenever a new M5Port is created.

* As long as the core waits for an ack from the memory system for every store
before issuing the next one, I can understand that memory model is independent
of how the memory system is implemented. But suppose the caches are multi-ported.
Then will the core only use one of the ports for stores and wait for an ack?
The current LSQ implementation uses as many ports as available. In this case,
would not the memory system need to ensure the order in which the stores are
performed?

* I think the current implementation handles blocks coherence permissions for
which were speculatively fetched. If the cache looses permissions on this
block, then it will forward the probe to the CPU. If the cache again receives
a probe for this block, I don't think that the CPU will have any instruction
using the value from that block.

* For testing, Prof. Wood suggested having some thing similar to TSOtool.


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1620
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Brad Beckmann
2011-10-28 16:55:39 UTC
Permalink
Post by Nilay Vaish
Post by Brad Beckmann
Thanks for the heads up on this patch. I'm glad you found the time to dive into it.
I'm confused that the comment mentions a "list of ports", but I don't see a list of ports in the code and I'm not sure how would even be used?
The two questions you pose are good ones. Hopefully someone who understands the O3 LSQ can answer the first, and I would suggest creating a new directed test that can manipulate the enqueue latency on the mandatory queue to create the necessary test situations.
- Ruby doesn't implement any particular memory model. It just implements the cache coherence protocol, and more specifically invalidation based protocols. The protocol, in combination with the core model, results in the memory model.
- I don't think it is sufficient to just forward those probes that hit valid copies to the O3 model. What about replacements of blocks that have serviced a speculative load? Instead, my thought would be to forward all probes to the O3 LSQ and think of cpu-controlled policies to filter out unecessary probes.
Hi Brad, thanks for the response.
* A list of ports has been added to RubyPort.hh, the ports are added
to the list whenever a new M5Port is created.
* As long as the core waits for an ack from the memory system for every store
before issuing the next one, I can understand that memory model is independent
of how the memory system is implemented. But suppose the caches are multi-ported.
Then will the core only use one of the ports for stores and wait for an ack?
The current LSQ implementation uses as many ports as available. In this case,
would not the memory system need to ensure the order in which the stores are
performed?
* I think the current implementation handles blocks coherence permissions for
which were speculatively fetched. If the cache looses permissions on this
block, then it will forward the probe to the CPU. If the cache again receives
a probe for this block, I don't think that the CPU will have any instruction
using the value from that block.
* For testing, Prof. Wood suggested having some thing similar to TSOtool.
Hmm...I'm now even more confused. I have not looked at the O3 LSQ, but it sounds like from your description that one particular instantiation of the LSQ will use N ports, not just a single port to the L1D. So does N equal the number of simultaneous loads and stores that can be issued per cycle, or is N equal to the number of outstanding loads and stores supported by the LSQ? Or does it equal something completely different?

Stores to different cache blocks can be issued to the memory system out-of-order and in parallel. Ruby already supports such functionality. The key is the store buffer must be drained in-order. It is up to the store buffer's functionality to get that right. Ruby can assist by providing interfaces for checking permission state and forwarding probes upstream, but it is up to the LSQ/store buffer to act appropriately and retry requests when necessary. I don't believe Ruby needs any fundamental changes to support x86-TSO. Instead, Ruby just needs to provide more information back to the LSQ.

Earlier I didn't notice that you also squash speculation on replacements, in addition to probes. Yeah, I think those changes take care of correctly squashing speculative loads. However, as I mentioned above, I still think we need to figure out how to provide the necessary information to allow stores to be issued in parallel, while still retiring in-order.

Implementing something similar to TSOtool would be great. However, I think there is benefit to do some quick tests using a DirectedTester before creating something like TSOtool.


- Brad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1620
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-10-28 21:13:56 UTC
Permalink
Post by Nilay Vaish
Post by Brad Beckmann
Thanks for the heads up on this patch. I'm glad you found the time to dive into it.
I'm confused that the comment mentions a "list of ports", but I don't see a list of ports in the code and I'm not sure how would even be used?
The two questions you pose are good ones. Hopefully someone who understands the O3 LSQ can answer the first, and I would suggest creating a new directed test that can manipulate the enqueue latency on the mandatory queue to create the necessary test situations.
- Ruby doesn't implement any particular memory model. It just implements the cache coherence protocol, and more specifically invalidation based protocols. The protocol, in combination with the core model, results in the memory model.
- I don't think it is sufficient to just forward those probes that hit valid copies to the O3 model. What about replacements of blocks that have serviced a speculative load? Instead, my thought would be to forward all probes to the O3 LSQ and think of cpu-controlled policies to filter out unecessary probes.
Hi Brad, thanks for the response.
* A list of ports has been added to RubyPort.hh, the ports are added
to the list whenever a new M5Port is created.
* As long as the core waits for an ack from the memory system for every store
before issuing the next one, I can understand that memory model is independent
of how the memory system is implemented. But suppose the caches are multi-ported.
Then will the core only use one of the ports for stores and wait for an ack?
The current LSQ implementation uses as many ports as available. In this case,
would not the memory system need to ensure the order in which the stores are
performed?
* I think the current implementation handles blocks coherence permissions for
which were speculatively fetched. If the cache looses permissions on this
block, then it will forward the probe to the CPU. If the cache again receives
a probe for this block, I don't think that the CPU will have any instruction
using the value from that block.
* For testing, Prof. Wood suggested having some thing similar to TSOtool.
Hmm...I'm now even more confused. I have not looked at the O3 LSQ, but it sounds like from your description that one particular instantiation of the LSQ will use N ports, not just a single port to the L1D. So does N equal the number of simultaneous loads and stores that can be issued per cycle, or is N equal to the number of outstanding loads and stores supported by the LSQ? Or does it equal something completely different?
Stores to different cache blocks can be issued to the memory system out-of-order and in parallel. Ruby already supports such functionality. The key is the store buffer must be drained in-order. It is up to the store buffer's functionality to get that right. Ruby can assist by providing interfaces for checking permission state and forwarding probes upstream, but it is up to the LSQ/store buffer to act appropriately and retry requests when necessary. I don't believe Ruby needs any fundamental changes to support x86-TSO. Instead, Ruby just needs to provide more information back to the LSQ.
Earlier I didn't notice that you also squash speculation on replacements, in addition to probes. Yeah, I think those changes take care of correctly squashing speculative loads. However, as I mentioned above, I still think we need to figure out how to provide the necessary information to allow stores to be issued in parallel, while still retiring in-order.
Implementing something similar to TSOtool would be great. However, I think there is benefit to do some quick tests using a DirectedTester before creating something like TSOtool.
Brad,

My understanding is that the LSQ can issue at most N loads and stores to
the memory system in each cycle.

For parallel stores, it seems that the core should have permissions for
these cache blocks all at the same time. Even if Ruby fetches coherence
permissions out-of-order, it would still have to ensure, for SC or TSO,
that stores that happened logically later in time become visible only
after all the earlier ones are visible to rest of the system. As of now,
I disagree with the statement that --
'' Stores to different cache blocks can be issued to the
memory system out-of-order and in parallel ''
Unless we have some kind of guarantee on the order in which these stores
become visible to the rest of the system, I don't see how we can separate
out the memory system's behavior from the consistency model.

I was thinking of writing a tester that reads in a trace of memory operations
performed by a multi-processor system and the times at which these are performed.
Then we can check the load values against the expected load values. I think the
underlying assumption is that everything behaves in a deterministic fashion. What
do you think?


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1620
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Brad Beckmann
2011-10-28 22:01:03 UTC
Permalink
Post by Nilay Vaish
Post by Brad Beckmann
Thanks for the heads up on this patch. I'm glad you found the time to dive into it.
I'm confused that the comment mentions a "list of ports", but I don't see a list of ports in the code and I'm not sure how would even be used?
The two questions you pose are good ones. Hopefully someone who understands the O3 LSQ can answer the first, and I would suggest creating a new directed test that can manipulate the enqueue latency on the mandatory queue to create the necessary test situations.
- Ruby doesn't implement any particular memory model. It just implements the cache coherence protocol, and more specifically invalidation based protocols. The protocol, in combination with the core model, results in the memory model.
- I don't think it is sufficient to just forward those probes that hit valid copies to the O3 model. What about replacements of blocks that have serviced a speculative load? Instead, my thought would be to forward all probes to the O3 LSQ and think of cpu-controlled policies to filter out unecessary probes.
Hi Brad, thanks for the response.
* A list of ports has been added to RubyPort.hh, the ports are added
to the list whenever a new M5Port is created.
* As long as the core waits for an ack from the memory system for every store
before issuing the next one, I can understand that memory model is independent
of how the memory system is implemented. But suppose the caches are multi-ported.
Then will the core only use one of the ports for stores and wait for an ack?
The current LSQ implementation uses as many ports as available. In this case,
would not the memory system need to ensure the order in which the stores are
performed?
* I think the current implementation handles blocks coherence permissions for
which were speculatively fetched. If the cache looses permissions on this
block, then it will forward the probe to the CPU. If the cache again receives
a probe for this block, I don't think that the CPU will have any instruction
using the value from that block.
* For testing, Prof. Wood suggested having some thing similar to TSOtool.
Hmm...I'm now even more confused. I have not looked at the O3 LSQ, but it sounds like from your description that one particular instantiation of the LSQ will use N ports, not just a single port to the L1D. So does N equal the number of simultaneous loads and stores that can be issued per cycle, or is N equal to the number of outstanding loads and stores supported by the LSQ? Or does it equal something completely different?
Stores to different cache blocks can be issued to the memory system out-of-order and in parallel. Ruby already supports such functionality. The key is the store buffer must be drained in-order. It is up to the store buffer's functionality to get that right. Ruby can assist by providing interfaces for checking permission state and forwarding probes upstream, but it is up to the LSQ/store buffer to act appropriately and retry requests when necessary. I don't believe Ruby needs any fundamental changes to support x86-TSO. Instead, Ruby just needs to provide more information back to the LSQ.
Earlier I didn't notice that you also squash speculation on replacements, in addition to probes. Yeah, I think those changes take care of correctly squashing speculative loads. However, as I mentioned above, I still think we need to figure out how to provide the necessary information to allow stores to be issued in parallel, while still retiring in-order.
Implementing something similar to TSOtool would be great. However, I think there is benefit to do some quick tests using a DirectedTester before creating something like TSOtool.
Brad,
My understanding is that the LSQ can issue at most N loads and stores to
the memory system in each cycle.
For parallel stores, it seems that the core should have permissions for
these cache blocks all at the same time. Even if Ruby fetches coherence
permissions out-of-order, it would still have to ensure, for SC or TSO,
that stores that happened logically later in time become visible only
after all the earlier ones are visible to rest of the system. As of now,
I disagree with the statement that --
'' Stores to different cache blocks can be issued to the
memory system out-of-order and in parallel ''
Unless we have some kind of guarantee on the order in which these stores
become visible to the rest of the system, I don't see how we can separate
out the memory system's behavior from the consistency model.
I was thinking of writing a tester that reads in a trace of memory operations
performed by a multi-processor system and the times at which these are performed.
Then we can check the load values against the expected load values. I think the
underlying assumption is that everything behaves in a deterministic fashion. What
do you think?
Thanks for confirming the O3 LSQ requirement for N ports. I've got no further questions on that.

Stores can certainly be issued out-of-order in modern x86 processors. It is the store buffer's responsibility to ensure that stores become globally visible in program order. Maybe what you're getting at is that Ruby needs to support a two-phase store scheme so that the initial writeHitCallback supplies data to the CPU but does not update the L1 D cache block. I would agree to that. My point is that Ruby should only be responsible to provide the necessary information and interfaces to the LSQ logic. There is no reason to change the logic of Ruby's invalidation-based coherence protocols. It is the LSQ's (including store buffer) responsibility to ensure the correct order of store visibility.

Yes, your tester idea is essentially what I had in mind. The only thing I want to point out is that it may beneficial to include both the time the request should issue and a delta of how long the request should be stalled in the mandatory queue. That way you can instigate races where younger memory ops deterministically bypass older ops.


- Brad


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1620
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-10-17 23:50:47)
Review request for Default.
Summary
-------
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
it.
Diffs
-----
build_opts/ALPHA_SE_MESI_CMP_directory 92ba80d63abc
configs/example/se.py 92ba80d63abc
configs/ruby/MESI_CMP_directory.py 92ba80d63abc
src/mem/protocol/MESI_CMP_directory-L1cache.sm 92ba80d63abc
src/mem/protocol/RubySlicc_Types.sm 92ba80d63abc
src/mem/ruby/system/RubyPort.hh 92ba80d63abc
src/mem/ruby/system/RubyPort.cc 92ba80d63abc
src/mem/ruby/system/Sequencer.hh 92ba80d63abc
src/mem/ruby/system/Sequencer.cc 92ba80d63abc
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Beckmann, Brad
2011-10-29 00:24:13 UTC
Permalink
Let’s move this conversation to just the email thread.

I suspect we may be talking past each other, so let’s talk about the complete implementations not just Ruby. There are multiple ways one can implement the store portion of x86-TSO. I’m not sure what the O3 model does, but here are a few possibilities:

- Do not issue any part of the store to the memory system when the instruction is executed. Instead, simply buffer it in the LSQ until the instruction retires, then buffer in the store buffer after retirement. Only when the store reaches the head of the store buffer, issue it to Ruby. The next store is not issued to Ruby until the previous store head completes, maintaining correct store ordering.

- Do not issue any part of the store to the memory system when the instruction is executed. Instead, simply buffer it in the LSQ until the instruction retires. Once it retires and enters the store buffer and we issue the address request to Ruby (no L1 data update). Ruby forwards probes/replacemetns to the store buffer and if the store buffer sees a probe/replacement to an address who’s address request has already completed, the store buffer reissues the request. Once the store reaches the head of the store buffer, double check with Ruby that write permissions still exist in the L1.

- Issue the store address (no L1 data update) to Ruby when the instruction is executed. When it retires, it enters the store buffer. Ruby forwards probes/replacemetns to the LSQ+store buffer and if either sees a probe/replacement to an address who’s address request has already completed, the request reissues (several policies exist on when to reissue the request). Once the store reaches the head of the store buffer, double check with Ruby that write permissions still exist in the L1.

Do those scenarios make sense to you? I believe we can implement any one of them without modifying Ruby’s core functionality. If you are envisioning or if O3 implements something completely different, please let me know.

Brad



From: Brad Beckmann [mailto:***@amd.com]
Sent: Friday, October 28, 2011 3:01 PM
To: Nilay Vaish; Beckmann, Brad; Default
Subject: Re: Review Request: Forward invalidations from Ruby to O3 CPU

This is an automatically generated e-mail. To reply, visit: http://reviews.m5sim.org/r/894/



On October 27th, 2011, 10:35 p.m., Brad Beckmann wrote:

Thanks for the heads up on this patch. I'm glad you found the time to dive into it.







I'm confused that the comment mentions a "list of ports", but I don't see a list of ports in the code and I'm not sure how would even be used?



The two questions you pose are good ones. Hopefully someone who understands the O3 LSQ can answer the first, and I would suggest creating a new directed test that can manipulate the enqueue latency on the mandatory queue to create the necessary test situations.



Also, I have a couple high-level comments right now:







- Ruby doesn't implement any particular memory model. It just implements the cache coherence protocol, and more specifically invalidation based protocols. The protocol, in combination with the core model, results in the memory model.





- I don't think it is sufficient to just forward those probes that hit valid copies to the O3 model. What about replacements of blocks that have serviced a speculative load? Instead, my thought would be to forward all probes to the O3 LSQ and think of cpu-controlled policies to filter out unecessary probes.

On October 28th, 2011, 3:32 a.m., Nilay Vaish wrote:

Hi Brad, thanks for the response.



* A list of ports has been added to RubyPort.hh, the ports are added

to the list whenever a new M5Port is created.



* As long as the core waits for an ack from the memory system for every store

before issuing the next one, I can understand that memory model is independent

of how the memory system is implemented. But suppose the caches are multi-ported.

Then will the core only use one of the ports for stores and wait for an ack?

The current LSQ implementation uses as many ports as available. In this case,

would not the memory system need to ensure the order in which the stores are

performed?



* I think the current implementation handles blocks coherence permissions for

which were speculatively fetched. If the cache looses permissions on this

block, then it will forward the probe to the CPU. If the cache again receives

a probe for this block, I don't think that the CPU will have any instruction

using the value from that block.



* For testing, Prof. Wood suggested having some thing similar to TSOtool.

On October 28th, 2011, 9:55 a.m., Brad Beckmann wrote:

Hmm...I'm now even more confused. I have not looked at the O3 LSQ, but it sounds like from your description that one particular instantiation of the LSQ will use N ports, not just a single port to the L1D. So does N equal the number of simultaneous loads and stores that can be issued per cycle, or is N equal to the number of outstanding loads and stores supported by the LSQ? Or does it equal something completely different?



Stores to different cache blocks can be issued to the memory system out-of-order and in parallel. Ruby already supports such functionality. The key is the store buffer must be drained in-order. It is up to the store buffer's functionality to get that right. Ruby can assist by providing interfaces for checking permission state and forwarding probes upstream, but it is up to the LSQ/store buffer to act appropriately and retry requests when necessary. I don't believe Ruby needs any fundamental changes to support x86-TSO. Instead, Ruby just needs to provide more information back to the LSQ.



Earlier I didn't notice that you also squash speculation on replacements, in addition to probes. Yeah, I think those changes take care of correctly squashing speculative loads. However, as I mentioned above, I still think we need to figure out how to provide the necessary information to allow stores to be issued in parallel, while still retiring in-order.



Implementing something similar to TSOtool would be great. However, I think there is benefit to do some quick tests using a DirectedTester before creating something like TSOtool.





On October 28th, 2011, 2:13 p.m., Nilay Vaish wrote:

Brad,



My understanding is that the LSQ can issue at most N loads and stores to

the memory system in each cycle.



For parallel stores, it seems that the core should have permissions for

these cache blocks all at the same time. Even if Ruby fetches coherence

permissions out-of-order, it would still have to ensure, for SC or TSO,

that stores that happened logically later in time become visible only

after all the earlier ones are visible to rest of the system. As of now,

I disagree with the statement that --

'' Stores to different cache blocks can be issued to the

memory system out-of-order and in parallel ''

Unless we have some kind of guarantee on the order in which these stores

become visible to the rest of the system, I don't see how we can separate

out the memory system's behavior from the consistency model.



I was thinking of writing a tester that reads in a trace of memory operations

performed by a multi-processor system and the times at which these are performed.

Then we can check the load values against the expected load values. I think the

underlying assumption is that everything behaves in a deterministic fashion. What

do you think?

Thanks for confirming the O3 LSQ requirement for N ports. I've got no further questions on that.



Stores can certainly be issued out-of-order in modern x86 processors. It is the store buffer's responsibility to ensure that stores become globally visible in program order. Maybe what you're getting at is that Ruby needs to support a two-phase store scheme so that the initial writeHitCallback supplies data to the CPU but does not update the L1 D cache block. I would agree to that. My point is that Ruby should only be responsible to provide the necessary information and interfaces to the LSQ logic. There is no reason to change the logic of Ruby's invalidation-based coherence protocols. It is the LSQ's (including store buffer) responsibility to ensure the correct order of store visibility.



Yes, your tester idea is essentially what I had in mind. The only thing I want to point out is that it may beneficial to include both the time the request should issue and a delta of how long the request should be stalled in the mandatory queue. That way you can instigate races where younger memory ops deterministically bypass older ops.


- Brad


On October 17th, 2011, 11:50 p.m., Nilay Vaish wrote:
Review request for Default.
By Nilay Vaish.

Updated 2011-10-17 23:50:47

Description

This patch implements the functionality for forwarding invalidations

and replacements from the L1 cache of the Ruby memory system to the O3

CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement

or an invalidation is performed, the L1 cache forwards this to all the ports,

which I believe is the LSQ in case of the O3 CPU. Those who understand the O3

LSQ should take a close look at the implementation and figure out (at least

qualitatively) if some thing is missing or erroneous.



This patch only modifies the MESI CMP directory protocol. I will modify other

protocols once we sort the major issues surrounding this patch.



My understanding is that this should ensure an SC execution, as

long as Ruby can support SC. But I think Ruby does not support any

memory model currently. A couple of issues that need discussion --



* Can this get in to a deadlock? A CPU may not be able to proceed if

a particularly cache block is repeatedly invalidated before the CPU

can retire the actual load/store instruction. How do we ensure that

at least one instruction is retired before an invalidation/replacement

is processed?



* How to test this implementation? Is it possible to implement some of the

tests that we regularly come across in papers on consistency models? Or

those present in manuals from AMD and Intel? I have tested that Ruby will

forward the invalidations, but not the part where the LSQ needs to act on

it.


Diffs

* build_opts/ALPHA_SE_MESI_CMP_directory (92ba80d63abc)
* configs/example/se.py (92ba80d63abc)
* configs/ruby/MESI_CMP_directory.py (92ba80d63abc)
* src/mem/protocol/MESI_CMP_directory-L1cache.sm (92ba80d63abc)
* src/mem/protocol/RubySlicc_Types.sm (92ba80d63abc)
* src/mem/ruby/system/RubyPort.hh (92ba80d63abc)
* src/mem/ruby/system/RubyPort.cc (92ba80d63abc)
* src/mem/ruby/system/Sequencer.hh (92ba80d63abc)
* src/mem/ruby/system/Sequencer.cc (92ba80d63abc)

View Diff<http://reviews.m5sim.org/r/894/diff/>
Nilay Vaish
2011-10-31 22:42:56 UTC
Permalink
Let’s move this conversation to just the email thread.
- Do not issue any part of the store to the memory system when the instruction is executed. Instead, simply buffer it in the LSQ until the instruction retires, then buffer in the store buffer after retirement. Only when the store reaches the head of the store buffer, issue it to Ruby. The next store is not issued to Ruby until the previous store head completes, maintaining correct store ordering.
- Do not issue any part of the store to the memory system when the instruction is executed. Instead, simply buffer it in the LSQ until the instruction retires. Once it retires and enters the store buffer and we issue the address request to Ruby (no L1 data update). Ruby forwards probes/replacemetns to the store buffer and if the store buffer sees a probe/replacement to an address who’s address request has already completed, the store buffer reissues the request. Once the store reaches the head of the store buffer, double check with Ruby that write permissions still exist in the L1.
- Issue the store address (no L1 data update) to Ruby when the instruction is executed. When it retires, it enters the store buffer. Ruby forwards probes/replacemetns to the LSQ+store buffer and if either sees a probe/replacement to an address who’s address request has already completed, the request reissues (several policies exist on when to reissue the request). Once the store reaches the head of the store buffer, double check with Ruby that write permissions still exist in the L1.
Do those scenarios make sense to you? I believe we can implement any one of them without modifying Ruby’s core functionality. If you are envisioning or if O3 implements something completely different, please let me know.
Brad
Brad, I will respond to your mail after I have sorted some of the other
stuff related to O3, X86, ...

--
Nilay
Nilay Vaish
2011-11-02 20:10:40 UTC
Permalink
Let’s move this conversation to just the email thread.
I suspect we may be talking past each other, so let’s talk about the
complete implementations not just Ruby. There are multiple ways one can
implement the store portion of x86-TSO. I’m not sure what the O3
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until the
instruction retires, then buffer in the store buffer after retirement.
Only when the store reaches the head of the store buffer, issue it to
Ruby. The next store is not issued to Ruby until the previous store
head completes, maintaining correct store ordering.
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until the
instruction retires. Once it retires and enters the store buffer and we
issue the address request to Ruby (no L1 data update). Ruby forwards
probes/replacemetns to the store buffer and if the store buffer sees a
probe/replacement to an address who’s address request has already
completed, the store buffer reissues the request. Once the store
reaches the head of the store buffer, double check with Ruby that write
permissions still exist in the L1.
- Issue the store address (no L1 data update) to Ruby when the
instruction is executed. When it retires, it enters the store buffer.
Ruby forwards probes/replacemetns to the LSQ+store buffer and if either
sees a probe/replacement to an address who’s address request has
already completed, the request reissues (several policies exist on when
to reissue the request). Once the store reaches the head of the store
buffer, double check with Ruby that write permissions still exist in the
L1.
Do those scenarios make sense to you? I believe we can implement any
one of them without modifying Ruby’s core functionality. If you are
envisioning or if O3 implements something completely different, please
let me know.
1. What's current memory model that the O3 CPU implements? Do we want
multiple memory models to co-exist? We might want to have both SC and TSO,
though Alpha had a weaker model.

2. I think we should try to stick what the O3 CPU implements currently,
meaning we should not change the stage when the store is issued to the
cache. I am more concerned about how multiple ports get handled.

--
Nilay
Nilay Vaish
2011-11-09 02:12:04 UTC
Permalink
Post by Nilay Vaish
Let’s move this conversation to just the email thread.
I suspect we may be talking past each other, so let’s talk about the
complete implementations not just Ruby. There are multiple ways one can
implement the store portion of x86-TSO. I’m not sure what the O3 model
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until the
instruction retires, then buffer in the store buffer after retirement. Only
when the store reaches the head of the store buffer, issue it to Ruby. The
next store is not issued to Ruby until the previous store head completes,
maintaining correct store ordering.
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until the
instruction retires. Once it retires and enters the store buffer and we
issue the address request to Ruby (no L1 data update). Ruby forwards
probes/replacemetns to the store buffer and if the store buffer sees a
probe/replacement to an address who’s address request has already
completed, the store buffer reissues the request. Once the store reaches
the head of the store buffer, double check with Ruby that write permissions
still exist in the L1.
- Issue the store address (no L1 data update) to Ruby when the instruction
is executed. When it retires, it enters the store buffer. Ruby forwards
probes/replacemetns to the LSQ+store buffer and if either sees a
probe/replacement to an address who’s address request has already
completed, the request reissues (several policies exist on when to reissue
the request). Once the store reaches the head of the store buffer, double
check with Ruby that write permissions still exist in the L1.
Do those scenarios make sense to you? I believe we can implement any one
of them without modifying Ruby’s core functionality. If you are
envisioning or if O3 implements something completely different, please let
me know.
1. What's current memory model that the O3 CPU implements? Do we want
multiple memory models to co-exist? We might want to have both SC and TSO,
though Alpha had a weaker model.
2. I think we should try to stick what the O3 CPU implements currently,
meaning we should not change the stage when the store is issued to the cache.
I am more concerned about how multiple ports get handled.
Looking at the trace generated by the toy application I use for testing
the O3 CPU and Ruby combination, I have been able to confirm my suspicion
that stores can become visible to the rest of the system in an order
different from the program order.

It might be that the classic memory system does not allow stores to go out
of order. Or that the initial implementation of the O3 CPU was for a
weaker memory model like that of Alpha architecture (Prof. Hill suggested
that this might be the case).

Overall I am still not clear on how to make O3 and Ruby work together
correctly for SC or TSO, in case when multiple stores can be issued to
the memory system in parallel.

--
Nilay
Beckmann, Brad
2011-11-09 06:44:30 UTC
Permalink
Hi Nilay,

With regards to your question about how to allow multiple simultaneous stores, do you not believe my second and third proposals achieve that?

As I stated before, I don't think we need to make any fundamental changes to Ruby. We just need to provide the correct information and interfaces to the LSQ/Store Buffer.

Brad
-----Original Message-----
Sent: Tuesday, November 08, 2011 6:12 PM
To: Beckmann, Brad
Cc: Default; Mark D. Hill
Subject: RE: Review Request: Forward invalidations from Ruby to O3 CPU
Post by Nilay Vaish
Post by Beckmann, Brad
Let’s move this conversation to just the email thread.
I suspect we may be talking past each other, so let’s talk about the
complete implementations not just Ruby. There are multiple ways one
can
Post by Nilay Vaish
Post by Beckmann, Brad
implement the store portion of x86-TSO. I’m not sure what the O3
model
Post by Nilay Vaish
Post by Beckmann, Brad
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until
the
Post by Nilay Vaish
Post by Beckmann, Brad
instruction retires, then buffer in the store buffer after
retirement. Only
Post by Nilay Vaish
Post by Beckmann, Brad
when the store reaches the head of the store buffer, issue it to
Ruby. The
Post by Nilay Vaish
Post by Beckmann, Brad
next store is not issued to Ruby until the previous store head
completes,
Post by Nilay Vaish
Post by Beckmann, Brad
maintaining correct store ordering.
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until
the
Post by Nilay Vaish
Post by Beckmann, Brad
instruction retires. Once it retires and enters the store buffer
and we
Post by Nilay Vaish
Post by Beckmann, Brad
issue the address request to Ruby (no L1 data update). Ruby
forwards
Post by Nilay Vaish
Post by Beckmann, Brad
probes/replacemetns to the store buffer and if the store buffer sees
a
Post by Nilay Vaish
Post by Beckmann, Brad
probe/replacement to an address who’s address request has already
completed, the store buffer reissues the request. Once the store
reaches
Post by Nilay Vaish
Post by Beckmann, Brad
the head of the store buffer, double check with Ruby that write
permissions
Post by Nilay Vaish
Post by Beckmann, Brad
still exist in the L1.
- Issue the store address (no L1 data update) to Ruby when the
instruction
Post by Nilay Vaish
Post by Beckmann, Brad
is executed. When it retires, it enters the store buffer. Ruby
forwards
Post by Nilay Vaish
Post by Beckmann, Brad
probes/replacemetns to the LSQ+store buffer and if either sees a
probe/replacement to an address who’s address request has already
completed, the request reissues (several policies exist on when to
reissue
Post by Nilay Vaish
Post by Beckmann, Brad
the request). Once the store reaches the head of the store buffer,
double
Post by Nilay Vaish
Post by Beckmann, Brad
check with Ruby that write permissions still exist in the L1.
Do those scenarios make sense to you? I believe we can implement
any one
Post by Nilay Vaish
Post by Beckmann, Brad
of them without modifying Ruby’s core functionality. If you are
envisioning or if O3 implements something completely different,
please let
Post by Nilay Vaish
Post by Beckmann, Brad
me know.
1. What's current memory model that the O3 CPU implements? Do we want
multiple memory models to co-exist? We might want to have both SC and
TSO,
Post by Nilay Vaish
though Alpha had a weaker model.
2. I think we should try to stick what the O3 CPU implements
currently,
Post by Nilay Vaish
meaning we should not change the stage when the store is issued to
the cache.
Post by Nilay Vaish
I am more concerned about how multiple ports get handled.
Looking at the trace generated by the toy application I use for testing
the O3 CPU and Ruby combination, I have been able to confirm my suspicion
that stores can become visible to the rest of the system in an order
different from the program order.
It might be that the classic memory system does not allow stores to go out
of order. Or that the initial implementation of the O3 CPU was for a
weaker memory model like that of Alpha architecture (Prof. Hill suggested
that this might be the case).
Overall I am still not clear on how to make O3 and Ruby work together
correctly for SC or TSO, in case when multiple stores can be issued to
the memory system in parallel.
--
Nilay
Nilay Vaish
2011-11-09 18:15:08 UTC
Permalink
Brad,

As long as we use multiple ports only to fetch coherence permissions and
only one store is performed at a time, it is intutively clear to me that
SC and TSO can be implemented. But if we implement this, it might mean
forgoing the Alpha-like memory model that we have in place right now. This
goes back to my earlier question on what memory model(s) are we interested
in? Do we prefer co-existence of multiple memory models?

--
Nilay
Post by Korey Sewell
Hi Nilay,
With regards to your question about how to allow multiple simultaneous
stores, do you not believe my second and third proposals achieve that?
As I stated before, I don't think we need to make any fundamental
changes to Ruby. We just need to provide the correct information and
interfaces to the LSQ/Store Buffer.
Brad
-----Original Message-----
Sent: Tuesday, November 08, 2011 6:12 PM
To: Beckmann, Brad
Cc: Default; Mark D. Hill
Subject: RE: Review Request: Forward invalidations from Ruby to O3 CPU
Post by Nilay Vaish
Let’s move this conversation to just the email thread.
I suspect we may be talking past each other, so let’s talk about the
complete implementations not just Ruby. There are multiple ways one
can
Post by Nilay Vaish
implement the store portion of x86-TSO. I’m not sure what the O3
model
Post by Nilay Vaish
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until
the
Post by Nilay Vaish
instruction retires, then buffer in the store buffer after
retirement. Only
Post by Nilay Vaish
when the store reaches the head of the store buffer, issue it to
Ruby. The
Post by Nilay Vaish
next store is not issued to Ruby until the previous store head
completes,
Post by Nilay Vaish
maintaining correct store ordering.
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until
the
Post by Nilay Vaish
instruction retires. Once it retires and enters the store buffer
and we
Post by Nilay Vaish
issue the address request to Ruby (no L1 data update). Ruby
forwards
Post by Nilay Vaish
probes/replacemetns to the store buffer and if the store buffer sees
a
Post by Nilay Vaish
probe/replacement to an address who’s address request has already
completed, the store buffer reissues the request. Once the store
reaches
Post by Nilay Vaish
the head of the store buffer, double check with Ruby that write
permissions
Post by Nilay Vaish
still exist in the L1.
- Issue the store address (no L1 data update) to Ruby when the
instruction
Post by Nilay Vaish
is executed. When it retires, it enters the store buffer. Ruby
forwards
Post by Nilay Vaish
probes/replacemetns to the LSQ+store buffer and if either sees a
probe/replacement to an address who’s address request has already
completed, the request reissues (several policies exist on when to
reissue
Post by Nilay Vaish
the request). Once the store reaches the head of the store buffer,
double
Post by Nilay Vaish
check with Ruby that write permissions still exist in the L1.
Do those scenarios make sense to you? I believe we can implement
any one
Post by Nilay Vaish
of them without modifying Ruby’s core functionality. If you are
envisioning or if O3 implements something completely different,
please let
Post by Nilay Vaish
me know.
1. What's current memory model that the O3 CPU implements? Do we want
multiple memory models to co-exist? We might want to have both SC and
TSO,
Post by Nilay Vaish
though Alpha had a weaker model.
2. I think we should try to stick what the O3 CPU implements
currently,
Post by Nilay Vaish
meaning we should not change the stage when the store is issued to
the cache.
Post by Nilay Vaish
I am more concerned about how multiple ports get handled.
Looking at the trace generated by the toy application I use for testing
the O3 CPU and Ruby combination, I have been able to confirm my suspicion
that stores can become visible to the rest of the system in an order
different from the program order.
It might be that the classic memory system does not allow stores to go out
of order. Or that the initial implementation of the O3 CPU was for a
weaker memory model like that of Alpha architecture (Prof. Hill suggested
that this might be the case).
Overall I am still not clear on how to make O3 and Ruby work together
correctly for SC or TSO, in case when multiple stores can be issued to
the memory system in parallel.
--
Nilay
Beckmann, Brad
2011-11-09 20:26:06 UTC
Permalink
I see. It sounds like you're still worried about how the RubyPort can support multiple M5 cpu ports and still adhere to a stronger consistency model. Sorry for not directly responding to that question earlier, but to me that seems like an orthogonal issue that you've already solved. If I recall correctly, the patch you sent out for review essentially attaches the multiple M5 cpu ports, representing simultaneous cpu requests, to the single RubyPort that represents the CPUs connection to the L1 caches. That seems reasonable to me and I don't see any problem with it. The key is that the cpu LSQ cannot blindly issue simultaneous requests to the memory system without expecting and acting upon probes that occur between issue and retirement. Furthermore, the CPU needs to communicate to Ruby when the instructions associated with the memory operations retire (for loads) or reach the head of the store buffer (for stores). Once Ruby receives that notification, it can stop monitoring that location and move the cache block to a base state.

Now to answer your specific question: We are definitely interested in a TSO model and in my opinion that is the only consistency model that we have to implement. Remember TSO is a valid implementation of Alpha's or ARM's weaker models. We can certainly implement subsequent models, but that should not be a short term goal.

I know this can be a complicated subject so please send me questions if you disagree or are confused. I certainly may be overlooking something and my thoughts are constantly evolving as well as I page more of this into my memory. For instance, I realize that my previous mail was incorrect because I confused the LSQ, which contains pre-retirement memory instructions, with the store buffer, which contains post-retirement store instruction values. If a probe hits in the store buffer, the CPU doesn't (it can't) reissue the store instruction. The store buffer shields the CPU from that probe. As long as the cache has write permission when the store reaches the head of the store buffer, stores have a global order and TSO is maintained. Of course probing loads in the LSQ also needs to occur, along with several other features for supporting locks, fences, etc.

If you do have further questions, please be specific as possible. It is hard to talk about this subject using generalities.

Brad
-----Original Message-----
Sent: Wednesday, November 09, 2011 10:15 AM
To: Beckmann, Brad
Cc: Default; Mark D. Hill
Subject: RE: Review Request: Forward invalidations from Ruby to O3 CPU
Brad,
As long as we use multiple ports only to fetch coherence permissions and
only one store is performed at a time, it is intutively clear to me that SC and
TSO can be implemented. But if we implement this, it might mean forgoing
the Alpha-like memory model that we have in place right now. This goes back
to my earlier question on what memory model(s) are we interested in? Do
we prefer co-existence of multiple memory models?
--
Nilay
Post by Korey Sewell
Hi Nilay,
With regards to your question about how to allow multiple simultaneous
stores, do you not believe my second and third proposals achieve that?
As I stated before, I don't think we need to make any fundamental
changes to Ruby. We just need to provide the correct information and
interfaces to the LSQ/Store Buffer.
Brad
-----Original Message-----
Sent: Tuesday, November 08, 2011 6:12 PM
To: Beckmann, Brad
Cc: Default; Mark D. Hill
Subject: RE: Review Request: Forward invalidations from Ruby to O3 CPU
Post by Nilay Vaish
Post by Beckmann, Brad
Let’s move this conversation to just the email thread.
I suspect we may be talking past each other, so let’s talk about
the complete implementations not just Ruby. There are multiple
ways one
can
Post by Nilay Vaish
Post by Beckmann, Brad
implement the store portion of x86-TSO. I’m not sure what the O3
model
Post by Nilay Vaish
Post by Beckmann, Brad
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until
the
Post by Nilay Vaish
Post by Beckmann, Brad
instruction retires, then buffer in the store buffer after
retirement. Only
Post by Nilay Vaish
Post by Beckmann, Brad
when the store reaches the head of the store buffer, issue it to
Ruby. The
Post by Nilay Vaish
Post by Beckmann, Brad
next store is not issued to Ruby until the previous store head
completes,
Post by Nilay Vaish
Post by Beckmann, Brad
maintaining correct store ordering.
- Do not issue any part of the store to the memory system when the
instruction is executed. Instead, simply buffer it in the LSQ until
the
Post by Nilay Vaish
Post by Beckmann, Brad
instruction retires. Once it retires and enters the store buffer
and we
Post by Nilay Vaish
Post by Beckmann, Brad
issue the address request to Ruby (no L1 data update). Ruby
forwards
Post by Nilay Vaish
Post by Beckmann, Brad
probes/replacemetns to the store buffer and if the store buffer sees
a
Post by Nilay Vaish
Post by Beckmann, Brad
probe/replacement to an address who’s address request has already
completed, the store buffer reissues the request. Once the store
reaches
Post by Nilay Vaish
Post by Beckmann, Brad
the head of the store buffer, double check with Ruby that write
permissions
Post by Nilay Vaish
Post by Beckmann, Brad
still exist in the L1.
- Issue the store address (no L1 data update) to Ruby when the
instruction
Post by Nilay Vaish
Post by Beckmann, Brad
is executed. When it retires, it enters the store buffer. Ruby
forwards
Post by Nilay Vaish
Post by Beckmann, Brad
probes/replacemetns to the LSQ+store buffer and if either sees a
probe/replacement to an address who’s address request has already
completed, the request reissues (several policies exist on when to
reissue
Post by Nilay Vaish
Post by Beckmann, Brad
the request). Once the store reaches the head of the store buffer,
double
Post by Nilay Vaish
Post by Beckmann, Brad
check with Ruby that write permissions still exist in the L1.
Do those scenarios make sense to you? I believe we can implement
any one
Post by Nilay Vaish
Post by Beckmann, Brad
of them without modifying Ruby’s core functionality. If you are
envisioning or if O3 implements something completely different,
please let
Post by Nilay Vaish
Post by Beckmann, Brad
me know.
1. What's current memory model that the O3 CPU implements? Do we
want multiple memory models to co-exist? We might want to have both
SC and
TSO,
Post by Nilay Vaish
though Alpha had a weaker model.
2. I think we should try to stick what the O3 CPU implements
currently,
Post by Nilay Vaish
meaning we should not change the stage when the store is issued to
the cache.
Post by Nilay Vaish
I am more concerned about how multiple ports get handled.
Looking at the trace generated by the toy application I use for
testing the O3 CPU and Ruby combination, I have been able to confirm
my suspicion that stores can become visible to the rest of the system
in an order different from the program order.
It might be that the classic memory system does not allow stores to
go out of order. Or that the initial implementation of the O3 CPU was
for a weaker memory model like that of Alpha architecture (Prof. Hill
suggested that this might be the case).
Overall I am still not clear on how to make O3 and Ruby work together
correctly for SC or TSO, in case when multiple stores can be issued
to the memory system in parallel.
--
Nilay
Nilay Vaish
2011-11-09 23:25:53 UTC
Permalink
Brad, your reply clears some air.

The current patch allows us to use the existing O3 CPU with Ruby. Since
the O3 CPU already provides Alpha's memory model, we get that for free.
Now that we would like to have TSO as well, we need to work out how the
two models would co-exist. I'll think more about this, but we need a
broader consensus on this.

--
Nilay
Post by Beckmann, Brad
I see. It sounds like you're still worried about how the RubyPort can
support multiple M5 cpu ports and still adhere to a stronger consistency
model. Sorry for not directly responding to that question earlier, but
to me that seems like an orthogonal issue that you've already solved.
If I recall correctly, the patch you sent out for review essentially
attaches the multiple M5 cpu ports, representing simultaneous cpu
requests, to the single RubyPort that represents the CPUs connection to
the L1 caches. That seems reasonable to me and I don't see any problem
with it. The key is that the cpu LSQ cannot blindly issue simultaneous
requests to the memory system without expecting and acting upon probes
that occur between issue and retirement. Furthermore, the CPU needs to
communicate to Ruby when the instructions associated with the memory
operations retire (for loads) or reach the head of the store buffer (for
stores). Once Ruby receives that notification, it can stop monitoring
that location and move the cache block to a base state.
Now to answer your specific question: We are definitely interested in a
TSO model and in my opinion that is the only consistency model that we
have to implement. Remember TSO is a valid implementation of Alpha's or
ARM's weaker models. We can certainly implement subsequent models, but
that should not be a short term goal.
I know this can be a complicated subject so please send me questions if
you disagree or are confused. I certainly may be overlooking something
and my thoughts are constantly evolving as well as I page more of this
into my memory. For instance, I realize that my previous mail was
incorrect because I confused the LSQ, which contains pre-retirement
memory instructions, with the store buffer, which contains
post-retirement store instruction values. If a probe hits in the store
buffer, the CPU doesn't (it can't) reissue the store instruction. The
store buffer shields the CPU from that probe. As long as the cache has
write permission when the store reaches the head of the store buffer,
stores have a global order and TSO is maintained. Of course probing
loads in the LSQ also needs to occur, along with several other features
for supporting locks, fences, etc.
If you do have further questions, please be specific as possible. It is
hard to talk about this subject using generalities.
Brad
Steve Reinhardt
2011-11-10 17:25:13 UTC
Permalink
It sounds like you guys are doing a good job of working this out, but I
have a few comments. Sorry not to jump in sooner.

- It's absolutely true that O3 was written for Alpha and as such did not
have to worry about the CPU model enforcing any memory orderings beyond the
explicit barrier/fence ops. It's no surprise that O3 needs to be modified
to support stronger consistency models.

- While TSO is a valid implementation of the Alpha memory model, we should
not unnecessarily restrict performance by constraining memory order. Note
that even though Alpha is not used much, we have other ISAs (most notably
ARM but also Power) that have weak memory models. In the near term it's
fine to just work on implementing TSO without considering Alpha, but for
the final commit it would be good to find a minimal set of changes that
enforce TSO and condition them on the ISA being x86 (or if we want to get
fancy, we could introduce a "memory consistency model" flag and set it to
TSO when the ISA is x86).

- Since we need to implement some consistency mechanism in O3 more or less
from scratch, I suggest we do a reasonably aggressive mechanism that
corresponds most closely with what modern processors do, without being
overly complicated. O3 is not intended to be an extremely accurate model
of any particular modern CPU, but we don't want to create unnecessary
differences between its behavior and that of a typical modern CPU either.

- If we need to make some changes in the Port interface to make this work
well, that's OK. Someday I would still like to see Port and RubyPort
integrated so we don't have to do a translation between the two structs on
every memory access. That probably doesn't affect this directly, but it's
good to keep in mind as we evolve the code.

Steve
Post by Nilay Vaish
Brad, your reply clears some air.
The current patch allows us to use the existing O3 CPU with Ruby. Since
the O3 CPU already provides Alpha's memory model, we get that for free. Now
that we would like to have TSO as well, we need to work out how the two
models would co-exist. I'll think more about this, but we need a broader
consensus on this.
--
Nilay
I see. It sounds like you're still worried about how the RubyPort can
Post by Beckmann, Brad
support multiple M5 cpu ports and still adhere to a stronger consistency
model. Sorry for not directly responding to that question earlier, but to
me that seems like an orthogonal issue that you've already solved. If I
recall correctly, the patch you sent out for review essentially attaches
the multiple M5 cpu ports, representing simultaneous cpu requests, to the
single RubyPort that represents the CPUs connection to the L1 caches. That
seems reasonable to me and I don't see any problem with it. The key is
that the cpu LSQ cannot blindly issue simultaneous requests to the memory
system without expecting and acting upon probes that occur between issue
and retirement. Furthermore, the CPU needs to communicate to Ruby when the
instructions associated with the memory operations retire (for loads) or
reach the head of the store buffer (for stores). Once Ruby receives that
notification, it can stop monitoring that location and move the cache block
to a base state.
Now to answer your specific question: We are definitely interested in a
TSO model and in my opinion that is the only consistency model that we have
to implement. Remember TSO is a valid implementation of Alpha's or ARM's
weaker models. We can certainly implement subsequent models, but that
should not be a short term goal.
I know this can be a complicated subject so please send me questions if
you disagree or are confused. I certainly may be overlooking something and
my thoughts are constantly evolving as well as I page more of this into my
memory. For instance, I realize that my previous mail was incorrect
because I confused the LSQ, which contains pre-retirement memory
instructions, with the store buffer, which contains post-retirement store
instruction values. If a probe hits in the store buffer, the CPU doesn't
(it can't) reissue the store instruction. The store buffer shields the CPU
from that probe. As long as the cache has write permission when the store
reaches the head of the store buffer, stores have a global order and TSO is
maintained. Of course probing loads in the LSQ also needs to occur, along
with several other features for supporting locks, fences, etc.
If you do have further questions, please be specific as possible. It is
hard to talk about this subject using generalities.
Brad
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
Nilay Vaish
2011-11-11 04:12:06 UTC
Permalink
Post by Steve Reinhardt
It sounds like you guys are doing a good job of working this out, but I
have a few comments. Sorry not to jump in sooner.
- It's absolutely true that O3 was written for Alpha and as such did not
have to worry about the CPU model enforcing any memory orderings beyond the
explicit barrier/fence ops. It's no surprise that O3 needs to be modified
to support stronger consistency models.
- While TSO is a valid implementation of the Alpha memory model, we should
not unnecessarily restrict performance by constraining memory order. Note
that even though Alpha is not used much, we have other ISAs (most notably
ARM but also Power) that have weak memory models. In the near term it's
fine to just work on implementing TSO without considering Alpha, but for
the final commit it would be good to find a minimal set of changes that
enforce TSO and condition them on the ISA being x86 (or if we want to get
fancy, we could introduce a "memory consistency model" flag and set it to
TSO when the ISA is x86).
I think the minimal change is to allow only one store to be in flight.
This means that once a store issued to the memory system, the load store
queue waits till the store gets done, before issuing another store. The
load store queue already forwards values from prior stores to the loads.
Post by Steve Reinhardt
- Since we need to implement some consistency mechanism in O3 more or less
from scratch, I suggest we do a reasonably aggressive mechanism that
corresponds most closely with what modern processors do, without being
overly complicated. O3 is not intended to be an extremely accurate model
of any particular modern CPU, but we don't want to create unnecessary
differences between its behavior and that of a typical modern CPU either.
We need to decide on this aggressive mechanism. Brad outlined several in
one of his previous emails. The approach described above is essentially
the first proposal that Brad suggested.
Post by Steve Reinhardt
- If we need to make some changes in the Port interface to make this work
well, that's OK. Someday I would still like to see Port and RubyPort
integrated so we don't have to do a translation between the two structs on
every memory access. That probably doesn't affect this directly, but it's
good to keep in mind as we evolve the code.
--
Nilay
Steve Reinhardt
2011-11-11 05:20:12 UTC
Permalink
Post by Steve Reinhardt
- While TSO is a valid implementation of the Alpha memory model, we should
Post by Steve Reinhardt
not unnecessarily restrict performance by constraining memory order. Note
that even though Alpha is not used much, we have other ISAs (most notably
ARM but also Power) that have weak memory models. In the near term it's
fine to just work on implementing TSO without considering Alpha, but for
the final commit it would be good to find a minimal set of changes that
enforce TSO and condition them on the ISA being x86 (or if we want to get
fancy, we could introduce a "memory consistency model" flag and set it to
TSO when the ISA is x86).
I think the minimal change is to allow only one store to be in flight.
This means that once a store issued to the memory system, the load store
queue waits till the store gets done, before issuing another store. The
load store queue already forwards values from prior stores to the loads.
Sorry, I didn't phrase that quite right... I was not implying that we
should make the minimal set of changes necessary to implement TSO. I meant
that once we make whatever set of changes we want to make for TSO
(preferably a reasonably aggressive one), then we should find a minimal
subset of those changes to condition on the ISA.
Post by Steve Reinhardt
- Since we need to implement some consistency mechanism in O3 more or less
Post by Steve Reinhardt
from scratch, I suggest we do a reasonably aggressive mechanism that
corresponds most closely with what modern processors do, without being
overly complicated. O3 is not intended to be an extremely accurate model
of any particular modern CPU, but we don't want to create unnecessary
differences between its behavior and that of a typical modern CPU either.
We need to decide on this aggressive mechanism. Brad outlined several in
one of his previous emails. The approach described above is essentially the
first proposal that Brad suggested.
Right, it looks like Brad's proposals were in order of aggressiveness, and
I was thinking that we should probably go for #2 or #3.

Steve
Nilay Vaish
2011-11-11 18:08:54 UTC
Permalink
Post by Steve Reinhardt
Post by Steve Reinhardt
- While TSO is a valid implementation of the Alpha memory model, we should
Post by Steve Reinhardt
not unnecessarily restrict performance by constraining memory order. Note
that even though Alpha is not used much, we have other ISAs (most notably
ARM but also Power) that have weak memory models. In the near term it's
fine to just work on implementing TSO without considering Alpha, but for
the final commit it would be good to find a minimal set of changes that
enforce TSO and condition them on the ISA being x86 (or if we want to get
fancy, we could introduce a "memory consistency model" flag and set it to
TSO when the ISA is x86).
I think the minimal change is to allow only one store to be in flight.
This means that once a store issued to the memory system, the load store
queue waits till the store gets done, before issuing another store. The
load store queue already forwards values from prior stores to the loads.
Sorry, I didn't phrase that quite right... I was not implying that we
should make the minimal set of changes necessary to implement TSO. I meant
that once we make whatever set of changes we want to make for TSO
(preferably a reasonably aggressive one), then we should find a minimal
subset of those changes to condition on the ISA.
Post by Steve Reinhardt
- Since we need to implement some consistency mechanism in O3 more or less
Post by Steve Reinhardt
from scratch, I suggest we do a reasonably aggressive mechanism that
corresponds most closely with what modern processors do, without being
overly complicated. O3 is not intended to be an extremely accurate model
of any particular modern CPU, but we don't want to create unnecessary
differences between its behavior and that of a typical modern CPU either.
We need to decide on this aggressive mechanism. Brad outlined several in
one of his previous emails. The approach described above is essentially the
first proposal that Brad suggested.
Right, it looks like Brad's proposals were in order of aggressiveness, and
I was thinking that we should probably go for #2 or #3.
I am thinking of implementing a modified version Brad's second proposal.
Each store accesses the cache at most twice. Once, when the store is
committed, following cases can happen --
* store is at the head of the store buffer, then issue write req to
memory system
* store is not at the head of the store buffer. If architecture has TSO
as memory model, then issue a read exclusive request to the memory system.
Else if a relaxed model is in place, a write request is sent to the memory
system.

Once the store reaches the head of the store buffer, the actual store is
issued in case of TSO. Even if the store buffer receives an invalidation
for some address to which a store exists in the buffer, the read ex
request would not be issued again.

--
Nilay
Nilay Vaish
2011-11-14 22:27:10 UTC
Permalink
I am thinking of implementing a modified version Brad's second proposal. Each
store accesses the cache at most twice. Once, when the store is committed,
following cases can happen --
* store is at the head of the store buffer, then issue write req to memory
system
* store is not at the head of the store buffer. If architecture has TSO as
memory model, then issue a read exclusive request to the memory system. Else
if a relaxed model is in place, a write request is sent to the memory system.
Once the store reaches the head of the store buffer, the actual store is
issued in case of TSO. Even if the store buffer receives an invalidation for
some address to which a store exists in the buffer, the read ex request would
not be issued again.
Before starting the implementation for this scheme, I had implemented the
scheme in which only the store at the head of the store queue is allowed
to be in flight.

For past two-three days, I have been trying to write the code for the
scheme described above. As of now, it seems to me that I am writing this
code by trying to debug the error I observe when running the simulation.
Given the complexity of the code, adding this feature would only
complicate the code further. And given that we want TSO and Alpha's memory
model to co-exist, it seems to me that we will be making this code lot
more complex than it is today.

My opinion is that we should go with the simple implementation, rather
than going for the aggressive approach and complicating the code further.

--
Nilay
Korey Sewell
2011-11-15 04:57:04 UTC
Permalink
Hey Nilay,
I just wanted to say another "good work" to you in tackling this issue. I
am looking forward to seeing an updated patch here. After the the ISCA
deadline, I'd like to lock in on this issue more and help where I can (It's
at least pretty important to my future work).

I've admittedly lost some touch on the specifics of this thread, but I
would agree with you on that it's probably the best path to get the simpler
version to work (at the head of the store buffer) first since that would be
a subset of the more complicated version anyway (cases where it's not at
the head of the store buffer).

Right?
Post by Nilay Vaish
I am thinking of implementing a modified version Brad's second proposal.
Post by Nilay Vaish
Each store accesses the cache at most twice. Once, when the store is
committed, following cases can happen --
* store is at the head of the store buffer, then issue write req to
memory system
* store is not at the head of the store buffer. If architecture has TSO
as memory model, then issue a read exclusive request to the memory system.
Else if a relaxed model is in place, a write request is sent to the memory
system.
Once the store reaches the head of the store buffer, the actual store is
issued in case of TSO. Even if the store buffer receives an invalidation
for some address to which a store exists in the buffer, the read ex request
would not be issued again.
Before starting the implementation for this scheme, I had implemented the
scheme in which only the store at the head of the store queue is allowed to
be in flight.
For past two-three days, I have been trying to write the code for the
scheme described above. As of now, it seems to me that I am writing this
code by trying to debug the error I observe when running the simulation.
Given the complexity of the code, adding this feature would only complicate
the code further. And given that we want TSO and Alpha's memory model to
co-exist, it seems to me that we will be making this code lot more complex
than it is today.
My opinion is that we should go with the simple implementation, rather
than going for the aggressive approach and complicating the code further.
--
Nilay
______________________________**_________________
gem5-dev mailing list
http://m5sim.org/mailman/**listinfo/gem5-dev<http://m5sim.org/mailman/listinfo/gem5-dev>
--
- Korey
Nilay Vaish
2011-10-31 01:11:26 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2011-10-30 18:11:26.652931)


Review request for Default.


Summary (updated)
-------

[mq]: RubyO3.1.patch


Diffs (updated)
-----

configs/example/se.py 5fbc6bd9de63
configs/ruby/MESI_CMP_directory.py 5fbc6bd9de63
src/mem/protocol/MESI_CMP_directory-L1cache.sm 5fbc6bd9de63
src/mem/protocol/RubySlicc_Types.sm 5fbc6bd9de63
src/mem/ruby/system/RubyPort.hh 5fbc6bd9de63
src/mem/ruby/system/RubyPort.cc 5fbc6bd9de63
src/mem/ruby/system/Sequencer.hh 5fbc6bd9de63
src/mem/ruby/system/Sequencer.cc 5fbc6bd9de63

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Nilay Vaish
2011-10-31 01:15:09 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2011-10-30 18:15:09.718146)


Review request for Default.


Summary (updated)
-------

O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.

This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.

My understanding is that this should ensure an SC execution, as long as Ruby
can support SC. But I think Ruby does not support any memory model currently.
A couple of issues that need discussion --

* Can this get in to a deadlock? A CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?

* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.


Diffs (updated)
-----

configs/example/se.py 5fbc6bd9de63
configs/ruby/MESI_CMP_directory.py 5fbc6bd9de63
src/mem/protocol/MESI_CMP_directory-L1cache.sm 5fbc6bd9de63
src/mem/protocol/RubySlicc_Types.sm 5fbc6bd9de63
src/mem/ruby/system/RubyPort.hh 5fbc6bd9de63
src/mem/ruby/system/RubyPort.cc 5fbc6bd9de63
src/mem/ruby/system/Sequencer.hh 5fbc6bd9de63
src/mem/ruby/system/Sequencer.cc 5fbc6bd9de63

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Nilay Vaish
2011-10-31 01:16:06 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2011-10-30 18:16:06.487162)


Review request for Default.


Summary
-------

O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.

This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.

My understanding is that this should ensure an SC execution, as long as Ruby
can support SC. But I think Ruby does not support any memory model currently.
A couple of issues that need discussion --

* Can this get in to a deadlock? A CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?

* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.


Diffs (updated)
-----

configs/example/se.py dd77c8d0a93e
configs/ruby/MESI_CMP_directory.py dd77c8d0a93e
src/mem/protocol/MESI_CMP_directory-L1cache.sm dd77c8d0a93e
src/mem/protocol/RubySlicc_Types.sm dd77c8d0a93e
src/mem/ruby/system/RubyPort.hh dd77c8d0a93e
src/mem/ruby/system/RubyPort.cc dd77c8d0a93e
src/mem/ruby/system/Sequencer.hh dd77c8d0a93e
src/mem/ruby/system/Sequencer.cc dd77c8d0a93e

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Nilay Vaish
2011-11-15 14:53:48 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2011-11-15 06:53:48.605157)


Review request for Default.


Summary
-------

O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.

This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.

My understanding is that this should ensure an SC execution, as long as Ruby
can support SC. But I think Ruby does not support any memory model currently.
A couple of issues that need discussion --

* Can this get in to a deadlock? A CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?

* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.


Diffs (updated)
-----

configs/example/se.py e66a566f2cfa
configs/ruby/MESI_CMP_directory.py e66a566f2cfa
src/mem/protocol/MESI_CMP_directory-L1cache.sm e66a566f2cfa
src/mem/protocol/RubySlicc_Types.sm e66a566f2cfa
src/mem/ruby/system/RubyPort.hh e66a566f2cfa
src/mem/ruby/system/RubyPort.cc e66a566f2cfa
src/mem/ruby/system/Sequencer.hh e66a566f2cfa
src/mem/ruby/system/Sequencer.cc e66a566f2cfa

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Brad Beckmann
2011-11-15 22:44:00 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1655
-----------------------------------------------------------


Nilay,

Thanks for pushing this patch along. This is a very important feature for gem5 and I'm glad we have you working on it.

First, to answer your questions:
- We can certainly avoid deadlock, but exactly how we do it depends on the interactions between the O3 CPU and Ruby. For the most part, it is up to the O3 model to avoid deadlock. I've heard through the grapevine that you are thinking about implementing the first, simplest option I suggested in my previous email. Essentially that is the one where the O3 model doesn't issue stores to Ruby until they reach the head of the store buffer. I think that is an excellent choice and it avoids having to worry about deadlock for stores since they are only issued to the memory system once they become non-speculative. In contrast, I'm sure the O3 model will issue speculative loads to Ruby and if the O3 CPU relies on speculative loads to succeed, we will encounter deadlock. However, as long as the
O3 model eventually issues a load non-speculatively, I'm pretty sure we can guarantee forward progress. Make sense?
- Testing at the CPU model is a great question. Do you know if the O3 model can read in a trace? If so, I would suggest a solution similar to the trace solution I suggested before to test Ruby. Basically you need the trace entries include a fixed delay so that you can enforce certain reorderings. I would use those fixed delay values to manipulate the delay in the mandatory queue.

A couple questions/comments:
- Why do you say that "My understanding is that this should ensure an SC execution, as long as Ruby can support SC. But I think Ruby does not support any memory model currently"? Ruby implements a cache coherence protocol, which is a component of a memory model, but in itself is not a memory model. Ruby can't alone support any particular memory model. However, I believe by forwarding probes and evictions to the CPU, Ruby can help support SC, TSO, or any other memory model. It is up to the CPU to act appropriately to achieve a certain model.
- I would modify the action name "cc_squash_speculation" to something like "foward_eviction_to_cpu". It is really up to the CPU and memory model to determine whether speculation should be squashed. We should not try to imply that Ruby is designed to support a specific memory model or CPU type.



- Brad
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-11-15 06:53:48)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as long as Ruby
can support SC. But I think Ruby does not support any memory model currently.
A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.
Diffs
-----
configs/example/se.py e66a566f2cfa
configs/ruby/MESI_CMP_directory.py e66a566f2cfa
src/mem/protocol/MESI_CMP_directory-L1cache.sm e66a566f2cfa
src/mem/protocol/RubySlicc_Types.sm e66a566f2cfa
src/mem/ruby/system/RubyPort.hh e66a566f2cfa
src/mem/ruby/system/RubyPort.cc e66a566f2cfa
src/mem/ruby/system/Sequencer.hh e66a566f2cfa
src/mem/ruby/system/Sequencer.cc e66a566f2cfa
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-11-15 23:37:47 UTC
Permalink
Post by Nilay Vaish
Post by Korey Sewell
Nilay,
Thanks for pushing this patch along. This is a very important feature for gem5 and I'm glad we have you working on it.
- We can certainly avoid deadlock, but exactly how we do it depends on the interactions between the O3 CPU and Ruby. For the most part, it is up to the O3 model to avoid deadlock. I've heard through the grapevine that you are thinking about implementing the first, simplest option I suggested in my previous email. Essentially that is the one where the O3 model doesn't issue stores to Ruby until they reach the head of the store buffer. I think that is an excellent choice and it avoids having to worry about deadlock for stores since they are only issued to the memory system once they become non-speculative. In contrast, I'm sure the O3 model will issue speculative loads to Ruby and if the O3 CPU relies on speculative loads to succeed, we will encounter deadlock. However, as long as
the O3 model eventually issues a load non-speculatively, I'm pretty sure we can guarantee forward progress. Make sense?
Post by Nilay Vaish
Post by Korey Sewell
- Testing at the CPU model is a great question. Do you know if the O3 model can read in a trace? If so, I would suggest a solution similar to the trace solution I suggested before to test Ruby. Basically you need the trace entries include a fixed delay so that you can enforce certain reorderings. I would use those fixed delay values to manipulate the delay in the mandatory queue.
- Why do you say that "My understanding is that this should ensure an SC execution, as long as Ruby can support SC. But I think Ruby does not support any memory model currently"? Ruby implements a cache coherence protocol, which is a component of a memory model, but in itself is not a memory model. Ruby can't alone support any particular memory model. However, I believe by forwarding probes and evictions to the CPU, Ruby can help support SC, TSO, or any other memory model. It is up to the CPU to act appropriately to achieve a certain model.
- I would modify the action name "cc_squash_speculation" to something like "foward_eviction_to_cpu". It is really up to the CPU and memory model to determine whether speculation should be squashed. We should not try to imply that Ruby is designed to support a specific memory model or CPU type.
Brad, those questions that appear in the description of the patch have been
there since I first posted the patch. In fact we did discuss those questions
earlier as well.

* So how do we ensure that at least one load is committed between successive
invalidations?

* I'll change the name of the function in the protocol.


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1655
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-11-15 06:53:48)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as long as Ruby
can support SC. But I think Ruby does not support any memory model currently.
A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.
Diffs
-----
configs/example/se.py e66a566f2cfa
configs/ruby/MESI_CMP_directory.py e66a566f2cfa
src/mem/protocol/MESI_CMP_directory-L1cache.sm e66a566f2cfa
src/mem/protocol/RubySlicc_Types.sm e66a566f2cfa
src/mem/ruby/system/RubyPort.hh e66a566f2cfa
src/mem/ruby/system/RubyPort.cc e66a566f2cfa
src/mem/ruby/system/Sequencer.hh e66a566f2cfa
src/mem/ruby/system/Sequencer.cc e66a566f2cfa
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Beckmann, Brad
2011-11-15 23:50:56 UTC
Permalink
Sorry for the confusion. Could you update the comment on the patch so that I don’t make that mistake again? Is that possible?

I don’t think Ruby needs to ensure that at least one load is committed between invalidations. All Ruby cache coherence protocols ensure that eventually any/all particular load operations complete. The CPU needs make sure that eventually it will issue loads non-speculatively so that when Ruby completes the load, the instruction retires and forward progress is maintained.

Brad


From: Nilay Vaish [mailto:***@cs.wisc.edu]
Sent: Tuesday, November 15, 2011 3:38 PM
To: Nilay Vaish; Beckmann, Brad; Default
Subject: Re: Review Request: O3, Ruby: Forward invalidations from Ruby to O3 CPU

This is an automatically generated e-mail. To reply, visit: http://reviews.m5sim.org/r/894/



On November 15th, 2011, 2:44 p.m., Brad Beckmann wrote:

Nilay,



Thanks for pushing this patch along. This is a very important feature for gem5 and I'm glad we have you working on it.



First, to answer your questions:

- We can certainly avoid deadlock, but exactly how we do it depends on the interactions between the O3 CPU and Ruby. For the most part, it is up to the O3 model to avoid deadlock. I've heard through the grapevine that you are thinking about implementing the first, simplest option I suggested in my previous email. Essentially that is the one where the O3 model doesn't issue stores to Ruby until they reach the head of the store buffer. I think that is an excellent choice and it avoids having to worry about deadlock for stores since they are only issued to the memory system once they become non-speculative. In contrast, I'm sure the O3 model will issue speculative loads to Ruby and if the O3 CPU relies on speculative loads to succeed, we will encounter deadlock. However, as long as the O3 model eventually issues a load non-speculatively, I'm pretty sure we can guarantee forward progress. Make sense?

- Testing at the CPU model is a great question. Do you know if the O3 model can read in a trace? If so, I would suggest a solution similar to the trace solution I suggested before to test Ruby. Basically you need the trace entries include a fixed delay so that you can enforce certain reorderings. I would use those fixed delay values to manipulate the delay in the mandatory queue.



A couple questions/comments:

- Why do you say that "My understanding is that this should ensure an SC execution, as long as Ruby can support SC. But I think Ruby does not support any memory model currently"? Ruby implements a cache coherence protocol, which is a component of a memory model, but in itself is not a memory model. Ruby can't alone support any particular memory model. However, I believe by forwarding probes and evictions to the CPU, Ruby can help support SC, TSO, or any other memory model. It is up to the CPU to act appropriately to achieve a certain model.

- I would modify the action name "cc_squash_speculation" to something like "foward_eviction_to_cpu". It is really up to the CPU and memory model to determine whether speculation should be squashed. We should not try to imply that Ruby is designed to support a specific memory model or CPU type.



Brad, those questions that appear in the description of the patch have been

there since I first posted the patch. In fact we did discuss those questions

earlier as well.



* So how do we ensure that at least one load is committed between successive

invalidations?



* I'll change the name of the function in the protocol.


- Nilay


On November 15th, 2011, 6:53 a.m., Nilay Vaish wrote:
Review request for Default.
By Nilay Vaish.

Updated 2011-11-15 06:53:48

Description

O3, Ruby: Forward invalidations from Ruby to O3 CPU

This patch implements the functionality for forwarding invalidations and

replacements from the L1 cache of the Ruby memory system to the O3 CPU. The

implementation adds a list of ports to RubyPort. Whenever a replacement or an

invalidation is performed, the L1 cache forwards this to all the ports, which

I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ

should take a close look at the implementation and figure out (at least

qualitatively) if some thing is missing or erroneous.



This patch only modifies the MESI CMP directory protocol. I will modify other

protocols once we sort the major issues surrounding this patch.



My understanding is that this should ensure an SC execution, as long as Ruby

can support SC. But I think Ruby does not support any memory model currently.

A couple of issues that need discussion --



* Can this get in to a deadlock? A CPU may not be able to proceed if a

particularly cache block is repeatedly invalidated before the CPU can retire

the actual load/store instruction. How do we ensure that at least one

instruction is retired before an invalidation/replacement is processed?



* How to test this implementation? Is it possible to implement some of the

tests that we regularly come across in papers on consistency models? Or those

present in manuals from AMD and Intel? I have tested that Ruby will forward

the invalidations, but not the part where the LSQ needs to act on it.


Diffs

* configs/example/se.py (e66a566f2cfa)
* configs/ruby/MESI_CMP_directory.py (e66a566f2cfa)
* src/mem/protocol/MESI_CMP_directory-L1cache.sm (e66a566f2cfa)
* src/mem/protocol/RubySlicc_Types.sm (e66a566f2cfa)
* src/mem/ruby/system/RubyPort.hh (e66a566f2cfa)
* src/mem/ruby/system/RubyPort.cc (e66a566f2cfa)
* src/mem/ruby/system/Sequencer.hh (e66a566f2cfa)
* src/mem/ruby/system/Sequencer.cc (e66a566f2cfa)

View Diff<http://reviews.m5sim.org/r/894/diff/>
Nilay Vaish
2011-11-18 23:28:45 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2011-11-18 15:28:45.145836)


Review request for Default.


Summary
-------

O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.

This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.

My understanding is that this should ensure an SC execution, as long as Ruby
can support SC. But I think Ruby does not support any memory model currently.
A couple of issues that need discussion --

* Can this get in to a deadlock? A CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?

* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.


Diffs (updated)
-----

configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Nilay Vaish
2011-11-18 23:30:25 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1664
-----------------------------------------------------------


- Nilay
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-11-18 15:28:45)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as long as Ruby
can support SC. But I think Ruby does not support any memory model currently.
A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.
Diffs
-----
configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-11-18 23:33:39 UTC
Permalink
The patch has been changed so that evictions are forwarded
by all the coherence protocols.


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1664
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-11-18 15:33:07)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
A couple of issues that need discussion --
* How to avoid deadlock in the O3 CPU? The CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.
Diffs
-----
configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-11-18 23:33:07 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2011-11-18 15:33:07.166356)


Review request for Default.


Summary (updated)
-------

O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.

A couple of issues that need discussion --

* How to avoid deadlock in the O3 CPU? The CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?

* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.


Diffs
-----

configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Brad Beckmann
2011-11-18 23:35:26 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1666
-----------------------------------------------------------



src/mem/protocol/MESI_CMP_directory-L1cache.sm
<http://reviews.m5sim.org/r/894/#comment2157>

Change name to send_evictions



src/mem/protocol/MI_example-cache.sm
<http://reviews.m5sim.org/r/894/#comment2158>

Same here



src/mem/protocol/MOESI_CMP_directory-L1cache.sm
<http://reviews.m5sim.org/r/894/#comment2159>

and here



src/mem/protocol/MOESI_CMP_token-L1cache.sm
<http://reviews.m5sim.org/r/894/#comment2160>

here



src/mem/protocol/MOESI_hammer-cache.sm
<http://reviews.m5sim.org/r/894/#comment2161>

here



src/mem/protocol/RubySlicc_Types.sm
<http://reviews.m5sim.org/r/894/#comment2162>

Change name to evictionCallback



src/mem/ruby/system/RubyPort.cc
<http://reviews.m5sim.org/r/894/#comment2163>

ruby_eviction_callback



src/mem/ruby/system/Sequencer.hh
<http://reviews.m5sim.org/r/894/#comment2164>

Again evictionCallback


- Brad
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-11-18 15:33:07)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
A couple of issues that need discussion --
* How to avoid deadlock in the O3 CPU? The CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.
Diffs
-----
configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-11-18 23:36:48 UTC
Permalink
This was real quick!

--
Nilay
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/#review1666
-----------------------------------------------------------
src/mem/protocol/MESI_CMP_directory-L1cache.sm
<http://reviews.m5sim.org/r/894/#comment2157>
Change name to send_evictions
src/mem/protocol/MI_example-cache.sm
<http://reviews.m5sim.org/r/894/#comment2158>
Same here
src/mem/protocol/MOESI_CMP_directory-L1cache.sm
<http://reviews.m5sim.org/r/894/#comment2159>
and here
src/mem/protocol/MOESI_CMP_token-L1cache.sm
<http://reviews.m5sim.org/r/894/#comment2160>
here
src/mem/protocol/MOESI_hammer-cache.sm
<http://reviews.m5sim.org/r/894/#comment2161>
here
src/mem/protocol/RubySlicc_Types.sm
<http://reviews.m5sim.org/r/894/#comment2162>
Change name to evictionCallback
src/mem/ruby/system/RubyPort.cc
<http://reviews.m5sim.org/r/894/#comment2163>
ruby_eviction_callback
src/mem/ruby/system/Sequencer.hh
<http://reviews.m5sim.org/r/894/#comment2164>
Again evictionCallback
- Brad
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-11-18 15:33:07)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
I believe is the LSQ in case of the O3 CPU. Those who understand the O3 LSQ
should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
A couple of issues that need discussion --
* How to avoid deadlock in the O3 CPU? The CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How do we ensure that at least one
instruction is retired before an invalidation/replacement is processed?
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or those
present in manuals from AMD and Intel? I have tested that Ruby will forward
the invalidations, but not the part where the LSQ needs to act on it.
Diffs
-----
configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-11-18 23:51:53 UTC
Permalink
Brad, all the comments below have been addressed in the latest
version of the patch.


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1666
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-11-18 15:50:36)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.
A couple of issues that need discussion --
* How to avoid deadlock in the O3 CPU? The CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How can the O3 CPU ensure that at least one
instruction is retired between successive evictions?
* How to test this implementation? Can the O3 CPU execute from a trace of
instructions? I have an implementation of the TSOTool. I think we can
implement some of the tests that we regularly come across in papers on
consistency models, and those present in manuals from AMD and Intel.
Diffs
-----
configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-11-18 23:50:36 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2011-11-18 15:50:36.549933)


Review request for Default.


Summary (updated)
-------

O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.

A couple of issues that need discussion --

* How to avoid deadlock in the O3 CPU? The CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How can the O3 CPU ensure that at least one
instruction is retired between successive evictions?

* How to test this implementation? Can the O3 CPU execute from a trace of
instructions? I have an implementation of the TSOTool. I think we can
implement some of the tests that we regularly come across in papers on
consistency models, and those present in manuals from AMD and Intel.


Diffs (updated)
-----

configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Gabe Black
2011-11-19 09:57:28 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1670
-----------------------------------------------------------



configs/example/se.py
<http://reviews.m5sim.org/r/894/#comment2165>

Why are you changing this? This looks really wrong.


- Gabe
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-11-18 15:50:36)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.
A couple of issues that need discussion --
* How to avoid deadlock in the O3 CPU? The CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How can the O3 CPU ensure that at least one
instruction is retired between successive evictions?
* How to test this implementation? Can the O3 CPU execute from a trace of
instructions? I have an implementation of the TSOTool. I think we can
implement some of the tests that we regularly come across in papers on
consistency models, and those present in manuals from AMD and Intel.
Diffs
-----
configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2011-11-19 15:57:56 UTC
Permalink
Post by Nilay Vaish
configs/example/se.py, line 188
<http://reviews.m5sim.org/r/894/diff/7/?file=15514#file15514line188>
Why are you changing this? This looks really wrong.
I would not be committing this. This is just for testing multithreaded applications.


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1670
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2011-11-18 15:50:36)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.
A couple of issues that need discussion --
* How to avoid deadlock in the O3 CPU? The CPU may not be able to proceed if a
particularly cache block is repeatedly invalidated before the CPU can retire
the actual load/store instruction. How can the O3 CPU ensure that at least one
instruction is retired between successive evictions?
* How to test this implementation? Can the O3 CPU execute from a trace of
instructions? I have an implementation of the TSOTool. I think we can
implement some of the tests that we regularly come across in papers on
consistency models, and those present in manuals from AMD and Intel.
Diffs
-----
configs/example/se.py 330f8109b199
configs/ruby/MESI_CMP_directory.py 330f8109b199
configs/ruby/MI_example.py 330f8109b199
configs/ruby/MOESI_CMP_directory.py 330f8109b199
configs/ruby/MOESI_CMP_token.py 330f8109b199
configs/ruby/MOESI_hammer.py 330f8109b199
src/mem/protocol/MESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MI_example-cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_CMP_token-L1cache.sm 330f8109b199
src/mem/protocol/MOESI_hammer-cache.sm 330f8109b199
src/mem/protocol/RubySlicc_Types.sm 330f8109b199
src/mem/ruby/system/RubyPort.hh 330f8109b199
src/mem/ruby/system/RubyPort.cc 330f8109b199
src/mem/ruby/system/Sequencer.hh 330f8109b199
src/mem/ruby/system/Sequencer.cc 330f8109b199
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2012-01-04 21:11:44 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2012-01-04 13:11:44.191446)


Review request for Default.


Summary (updated)
-------

O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.


Diffs (updated)
-----

configs/ruby/MESI_CMP_directory.py 09b482ee9ae0
configs/ruby/MI_example.py 09b482ee9ae0
configs/ruby/MOESI_CMP_directory.py 09b482ee9ae0
configs/ruby/MOESI_CMP_token.py 09b482ee9ae0
configs/ruby/MOESI_hammer.py 09b482ee9ae0
src/mem/protocol/MESI_CMP_directory-L1cache.sm 09b482ee9ae0
src/mem/protocol/MI_example-cache.sm 09b482ee9ae0
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 09b482ee9ae0
src/mem/protocol/MOESI_CMP_token-L1cache.sm 09b482ee9ae0
src/mem/protocol/MOESI_hammer-cache.sm 09b482ee9ae0
src/mem/protocol/RubySlicc_Types.sm 09b482ee9ae0
src/mem/ruby/system/RubyPort.hh 09b482ee9ae0
src/mem/ruby/system/RubyPort.cc 09b482ee9ae0
src/mem/ruby/system/Sequencer.hh 09b482ee9ae0
src/mem/ruby/system/Sequencer.cc 09b482ee9ae0

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Brad Beckmann
2012-01-07 00:59:56 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1862
-----------------------------------------------------------

Ship it!


Just one minor question about the packet you create. Other than that, this looks good.


src/mem/ruby/system/RubyPort.cc
<http://reviews.m5sim.org/r/894/#comment2368>

Should this use a different MemCmd then ReadExReq?


- Brad
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2012-01-04 13:11:44)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.
Diffs
-----
configs/ruby/MESI_CMP_directory.py 09b482ee9ae0
configs/ruby/MI_example.py 09b482ee9ae0
configs/ruby/MOESI_CMP_directory.py 09b482ee9ae0
configs/ruby/MOESI_CMP_token.py 09b482ee9ae0
configs/ruby/MOESI_hammer.py 09b482ee9ae0
src/mem/protocol/MESI_CMP_directory-L1cache.sm 09b482ee9ae0
src/mem/protocol/MI_example-cache.sm 09b482ee9ae0
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 09b482ee9ae0
src/mem/protocol/MOESI_CMP_token-L1cache.sm 09b482ee9ae0
src/mem/protocol/MOESI_hammer-cache.sm 09b482ee9ae0
src/mem/protocol/RubySlicc_Types.sm 09b482ee9ae0
src/mem/ruby/system/RubyPort.hh 09b482ee9ae0
src/mem/ruby/system/RubyPort.cc 09b482ee9ae0
src/mem/ruby/system/Sequencer.hh 09b482ee9ae0
src/mem/ruby/system/Sequencer.cc 09b482ee9ae0
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Nilay Vaish
2012-01-07 03:12:23 UTC
Permalink
Post by Nilay Vaish
src/mem/ruby/system/RubyPort.cc, line 699
<http://reviews.m5sim.org/r/894/diff/8/?file=16993#file16993line699>
Should this use a different MemCmd then ReadExReq?
The current code in cache_impl.hh reads the command from the
packet the cache just snooped upon and creates a snoop packet
with the same command. This snoop packet is received by the
lsq. In case of a ReadExReq, the cache will send a snoop
packet to the lsq. Hence, I chose this command. I did not
find any thing better in packet.cc, unless we want to define
a new command.


- Nilay


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/#review1862
-----------------------------------------------------------
Post by Nilay Vaish
-----------------------------------------------------------
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------
(Updated 2012-01-04 13:11:44)
Review request for Default.
Summary
-------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.
Diffs
-----
configs/ruby/MESI_CMP_directory.py 09b482ee9ae0
configs/ruby/MI_example.py 09b482ee9ae0
configs/ruby/MOESI_CMP_directory.py 09b482ee9ae0
configs/ruby/MOESI_CMP_token.py 09b482ee9ae0
configs/ruby/MOESI_hammer.py 09b482ee9ae0
src/mem/protocol/MESI_CMP_directory-L1cache.sm 09b482ee9ae0
src/mem/protocol/MI_example-cache.sm 09b482ee9ae0
src/mem/protocol/MOESI_CMP_directory-L1cache.sm 09b482ee9ae0
src/mem/protocol/MOESI_CMP_token-L1cache.sm 09b482ee9ae0
src/mem/protocol/MOESI_hammer-cache.sm 09b482ee9ae0
src/mem/protocol/RubySlicc_Types.sm 09b482ee9ae0
src/mem/ruby/system/RubyPort.hh 09b482ee9ae0
src/mem/ruby/system/RubyPort.cc 09b482ee9ae0
src/mem/ruby/system/Sequencer.hh 09b482ee9ae0
src/mem/ruby/system/Sequencer.cc 09b482ee9ae0
Diff: http://reviews.m5sim.org/r/894/diff
Testing
-------
Thanks,
Nilay
Beckmann, Brad
2012-01-09 20:09:26 UTC
Permalink
If it doesn’t impact the lsq design much, would prefer a new command so that the code is more clear. At the very least, there should be a comment better explaining what is going on and what makes the packet a snoop packet.

Brad



From: Nilay Vaish [mailto:***@cs.wisc.edu]
Sent: Friday, January 06, 2012 7:12 PM
To: Nilay Vaish; Beckmann, Brad; Default
Subject: Re: Review Request: O3, Ruby: Forward invalidations from Ruby to O3 CPU

This is an automatically generated e-mail. To reply, visit: http://reviews.m5sim.org/r/894/



On January 6th, 2012, 4:59 p.m., Brad Beckmann wrote:
src/mem/ruby/system/RubyPort.cc<http://reviews.m5sim.org/r/894/diff/8/?file=16993#file16993line699> (Diff revision 8)



RubyPort::ruby_eviction_callback(const Address& address)


699


Packet *pkt = new Packet(&req, MemCmd::ReadExReq, -1);



Should this use a different MemCmd then ReadExReq?

The current code in cache_impl.hh reads the command from the

packet the cache just snooped upon and creates a snoop packet

with the same command. This snoop packet is received by the

lsq. In case of a ReadExReq, the cache will send a snoop

packet to the lsq. Hence, I chose this command. I did not

find any thing better in packet.cc, unless we want to define

a new command.


- Nilay


On January 4th, 2012, 1:11 p.m., Nilay Vaish wrote:
Review request for Default.
By Nilay Vaish.

Updated 2012-01-04 13:11:44

Description

O3, Ruby: Forward invalidations from Ruby to O3 CPU

This patch implements the functionality for forwarding invalidations and

replacements from the L1 cache of the Ruby memory system to the O3 CPU. The

implementation adds a list of ports to RubyPort. Whenever a replacement or an

invalidation is performed, the L1 cache forwards this to all the ports, which

is the LSQ in case of the O3 CPU.


Diffs

* configs/ruby/MESI_CMP_directory.py (09b482ee9ae0)
* configs/ruby/MI_example.py (09b482ee9ae0)
* configs/ruby/MOESI_CMP_directory.py (09b482ee9ae0)
* configs/ruby/MOESI_CMP_token.py (09b482ee9ae0)
* configs/ruby/MOESI_hammer.py (09b482ee9ae0)
* src/mem/protocol/MESI_CMP_directory-L1cache.sm (09b482ee9ae0)
* src/mem/protocol/MI_example-cache.sm (09b482ee9ae0)
* src/mem/protocol/MOESI_CMP_directory-L1cache.sm (09b482ee9ae0)
* src/mem/protocol/MOESI_CMP_token-L1cache.sm (09b482ee9ae0)
* src/mem/protocol/MOESI_hammer-cache.sm (09b482ee9ae0)
* src/mem/protocol/RubySlicc_Types.sm (09b482ee9ae0)
* src/mem/ruby/system/RubyPort.hh (09b482ee9ae0)
* src/mem/ruby/system/RubyPort.cc (09b482ee9ae0)
* src/mem/ruby/system/Sequencer.hh (09b482ee9ae0)
* src/mem/ruby/system/Sequencer.cc (09b482ee9ae0)

View Diff<http://reviews.m5sim.org/r/894/diff/>
Nilay Vaish
2012-01-13 12:01:58 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2012-01-13 04:01:57.966969)


Review request for Default.


Summary
-------

O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.


Diffs (updated)
-----

configs/ruby/MESI_CMP_directory.py ec81887756c4
configs/ruby/MI_example.py ec81887756c4
configs/ruby/MOESI_CMP_directory.py ec81887756c4
configs/ruby/MOESI_CMP_token.py ec81887756c4
configs/ruby/MOESI_hammer.py ec81887756c4
src/mem/protocol/MESI_CMP_directory-L1cache.sm ec81887756c4
src/mem/protocol/MI_example-cache.sm ec81887756c4
src/mem/protocol/MOESI_CMP_directory-L1cache.sm ec81887756c4
src/mem/protocol/MOESI_CMP_token-L1cache.sm ec81887756c4
src/mem/protocol/MOESI_hammer-cache.sm ec81887756c4
src/mem/protocol/RubySlicc_Types.sm ec81887756c4
src/mem/ruby/system/RubyPort.hh ec81887756c4
src/mem/ruby/system/RubyPort.cc ec81887756c4
src/mem/ruby/system/Sequencer.hh ec81887756c4
src/mem/ruby/system/Sequencer.cc ec81887756c4

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Nilay Vaish
2012-01-13 12:22:05 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2012-01-13 04:22:05.301816)


Review request for Default.


Summary (updated)
-------

Changeset 8700:4e8b3783af9e
---------------------------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.


Diffs (updated)
-----

configs/ruby/MESI_CMP_directory.py f348cf78072c
configs/ruby/MI_example.py f348cf78072c
configs/ruby/MOESI_CMP_directory.py f348cf78072c
configs/ruby/MOESI_CMP_token.py f348cf78072c
configs/ruby/MOESI_hammer.py f348cf78072c
src/mem/protocol/MESI_CMP_directory-L1cache.sm f348cf78072c
src/mem/protocol/MI_example-cache.sm f348cf78072c
src/mem/protocol/MOESI_CMP_directory-L1cache.sm f348cf78072c
src/mem/protocol/MOESI_CMP_token-L1cache.sm f348cf78072c
src/mem/protocol/MOESI_hammer-cache.sm f348cf78072c
src/mem/protocol/RubySlicc_Types.sm f348cf78072c
src/mem/ruby/system/RubyPort.hh f348cf78072c
src/mem/ruby/system/RubyPort.cc f348cf78072c
src/mem/ruby/system/Sequencer.hh f348cf78072c
src/mem/ruby/system/Sequencer.cc f348cf78072c

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay
Nilay Vaish
2012-01-13 12:22:55 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.m5sim.org/r/894/
-----------------------------------------------------------

(Updated 2012-01-13 04:22:55.155470)


Review request for Default.


Summary (updated)
-------

Changeset 8700:1c93580f459b
---------------------------
O3, Ruby: Forward invalidations from Ruby to O3 CPU
This patch implements the functionality for forwarding invalidations and
replacements from the L1 cache of the Ruby memory system to the O3 CPU. The
implementation adds a list of ports to RubyPort. Whenever a replacement or an
invalidation is performed, the L1 cache forwards this to all the ports, which
is the LSQ in case of the O3 CPU.


Diffs (updated)
-----

configs/ruby/MESI_CMP_directory.py f348cf78072c
configs/ruby/MI_example.py f348cf78072c
configs/ruby/MOESI_CMP_directory.py f348cf78072c
configs/ruby/MOESI_CMP_token.py f348cf78072c
configs/ruby/MOESI_hammer.py f348cf78072c
src/mem/protocol/MESI_CMP_directory-L1cache.sm f348cf78072c
src/mem/protocol/MI_example-cache.sm f348cf78072c
src/mem/protocol/MOESI_CMP_directory-L1cache.sm f348cf78072c
src/mem/protocol/MOESI_CMP_token-L1cache.sm f348cf78072c
src/mem/protocol/MOESI_hammer-cache.sm f348cf78072c
src/mem/protocol/RubySlicc_Types.sm f348cf78072c
src/mem/ruby/system/RubyPort.hh f348cf78072c
src/mem/ruby/system/RubyPort.cc f348cf78072c
src/mem/ruby/system/Sequencer.hh f348cf78072c
src/mem/ruby/system/Sequencer.cc f348cf78072c

Diff: http://reviews.m5sim.org/r/894/diff


Testing
-------


Thanks,

Nilay

Continue reading on narkive:
Loading...