Let’s move this conversation to just the email thread.
I suspect we may be talking past each other, so let’s talk about the complete implementations not just Ruby. There are multiple ways one can implement the store portion of x86-TSO. I’m not sure what the O3 model does, but here are a few possibilities:
- Do not issue any part of the store to the memory system when the instruction is executed. Instead, simply buffer it in the LSQ until the instruction retires, then buffer in the store buffer after retirement. Only when the store reaches the head of the store buffer, issue it to Ruby. The next store is not issued to Ruby until the previous store head completes, maintaining correct store ordering.
- Do not issue any part of the store to the memory system when the instruction is executed. Instead, simply buffer it in the LSQ until the instruction retires. Once it retires and enters the store buffer and we issue the address request to Ruby (no L1 data update). Ruby forwards probes/replacemetns to the store buffer and if the store buffer sees a probe/replacement to an address who’s address request has already completed, the store buffer reissues the request. Once the store reaches the head of the store buffer, double check with Ruby that write permissions still exist in the L1.
- Issue the store address (no L1 data update) to Ruby when the instruction is executed. When it retires, it enters the store buffer. Ruby forwards probes/replacemetns to the LSQ+store buffer and if either sees a probe/replacement to an address who’s address request has already completed, the request reissues (several policies exist on when to reissue the request). Once the store reaches the head of the store buffer, double check with Ruby that write permissions still exist in the L1.
Do those scenarios make sense to you? I believe we can implement any one of them without modifying Ruby’s core functionality. If you are envisioning or if O3 implements something completely different, please let me know.
From: Brad Beckmann [mailto:***@amd.com]
Sent: Friday, October 28, 2011 3:01 PM
To: Nilay Vaish; Beckmann, Brad; Default
Subject: Re: Review Request: Forward invalidations from Ruby to O3 CPU
This is an automatically generated e-mail. To reply, visit: http://reviews.m5sim.org/r/894/
On October 27th, 2011, 10:35 p.m., Brad Beckmann wrote:
Thanks for the heads up on this patch. I'm glad you found the time to dive into it.
I'm confused that the comment mentions a "list of ports", but I don't see a list of ports in the code and I'm not sure how would even be used?
The two questions you pose are good ones. Hopefully someone who understands the O3 LSQ can answer the first, and I would suggest creating a new directed test that can manipulate the enqueue latency on the mandatory queue to create the necessary test situations.
Also, I have a couple high-level comments right now:
- Ruby doesn't implement any particular memory model. It just implements the cache coherence protocol, and more specifically invalidation based protocols. The protocol, in combination with the core model, results in the memory model.
- I don't think it is sufficient to just forward those probes that hit valid copies to the O3 model. What about replacements of blocks that have serviced a speculative load? Instead, my thought would be to forward all probes to the O3 LSQ and think of cpu-controlled policies to filter out unecessary probes.
On October 28th, 2011, 3:32 a.m., Nilay Vaish wrote:
Hi Brad, thanks for the response.
* A list of ports has been added to RubyPort.hh, the ports are added
to the list whenever a new M5Port is created.
* As long as the core waits for an ack from the memory system for every store
before issuing the next one, I can understand that memory model is independent
of how the memory system is implemented. But suppose the caches are multi-ported.
Then will the core only use one of the ports for stores and wait for an ack?
The current LSQ implementation uses as many ports as available. In this case,
would not the memory system need to ensure the order in which the stores are
* I think the current implementation handles blocks coherence permissions for
which were speculatively fetched. If the cache looses permissions on this
block, then it will forward the probe to the CPU. If the cache again receives
a probe for this block, I don't think that the CPU will have any instruction
using the value from that block.
* For testing, Prof. Wood suggested having some thing similar to TSOtool.
On October 28th, 2011, 9:55 a.m., Brad Beckmann wrote:
Hmm...I'm now even more confused. I have not looked at the O3 LSQ, but it sounds like from your description that one particular instantiation of the LSQ will use N ports, not just a single port to the L1D. So does N equal the number of simultaneous loads and stores that can be issued per cycle, or is N equal to the number of outstanding loads and stores supported by the LSQ? Or does it equal something completely different?
Stores to different cache blocks can be issued to the memory system out-of-order and in parallel. Ruby already supports such functionality. The key is the store buffer must be drained in-order. It is up to the store buffer's functionality to get that right. Ruby can assist by providing interfaces for checking permission state and forwarding probes upstream, but it is up to the LSQ/store buffer to act appropriately and retry requests when necessary. I don't believe Ruby needs any fundamental changes to support x86-TSO. Instead, Ruby just needs to provide more information back to the LSQ.
Earlier I didn't notice that you also squash speculation on replacements, in addition to probes. Yeah, I think those changes take care of correctly squashing speculative loads. However, as I mentioned above, I still think we need to figure out how to provide the necessary information to allow stores to be issued in parallel, while still retiring in-order.
Implementing something similar to TSOtool would be great. However, I think there is benefit to do some quick tests using a DirectedTester before creating something like TSOtool.
On October 28th, 2011, 2:13 p.m., Nilay Vaish wrote:
My understanding is that the LSQ can issue at most N loads and stores to
the memory system in each cycle.
For parallel stores, it seems that the core should have permissions for
these cache blocks all at the same time. Even if Ruby fetches coherence
permissions out-of-order, it would still have to ensure, for SC or TSO,
that stores that happened logically later in time become visible only
after all the earlier ones are visible to rest of the system. As of now,
I disagree with the statement that --
'' Stores to different cache blocks can be issued to the
memory system out-of-order and in parallel ''
Unless we have some kind of guarantee on the order in which these stores
become visible to the rest of the system, I don't see how we can separate
out the memory system's behavior from the consistency model.
I was thinking of writing a tester that reads in a trace of memory operations
performed by a multi-processor system and the times at which these are performed.
Then we can check the load values against the expected load values. I think the
underlying assumption is that everything behaves in a deterministic fashion. What
do you think?
Thanks for confirming the O3 LSQ requirement for N ports. I've got no further questions on that.
Stores can certainly be issued out-of-order in modern x86 processors. It is the store buffer's responsibility to ensure that stores become globally visible in program order. Maybe what you're getting at is that Ruby needs to support a two-phase store scheme so that the initial writeHitCallback supplies data to the CPU but does not update the L1 D cache block. I would agree to that. My point is that Ruby should only be responsible to provide the necessary information and interfaces to the LSQ logic. There is no reason to change the logic of Ruby's invalidation-based coherence protocols. It is the LSQ's (including store buffer) responsibility to ensure the correct order of store visibility.
Yes, your tester idea is essentially what I had in mind. The only thing I want to point out is that it may beneficial to include both the time the request should issue and a delta of how long the request should be stalled in the mandatory queue. That way you can instigate races where younger memory ops deterministically bypass older ops.
On October 17th, 2011, 11:50 p.m., Nilay Vaish wrote:
Review request for Default.
By Nilay Vaish.
Updated 2011-10-17 23:50:47
This patch implements the functionality for forwarding invalidations
and replacements from the L1 cache of the Ruby memory system to the O3
CPU. The implementation adds a list of ports to RubyPort. Whenever a replacement
or an invalidation is performed, the L1 cache forwards this to all the ports,
which I believe is the LSQ in case of the O3 CPU. Those who understand the O3
LSQ should take a close look at the implementation and figure out (at least
qualitatively) if some thing is missing or erroneous.
This patch only modifies the MESI CMP directory protocol. I will modify other
protocols once we sort the major issues surrounding this patch.
My understanding is that this should ensure an SC execution, as
long as Ruby can support SC. But I think Ruby does not support any
memory model currently. A couple of issues that need discussion --
* Can this get in to a deadlock? A CPU may not be able to proceed if
a particularly cache block is repeatedly invalidated before the CPU
can retire the actual load/store instruction. How do we ensure that
at least one instruction is retired before an invalidation/replacement
* How to test this implementation? Is it possible to implement some of the
tests that we regularly come across in papers on consistency models? Or
those present in manuals from AMD and Intel? I have tested that Ruby will
forward the invalidations, but not the part where the LSQ needs to act on
* build_opts/ALPHA_SE_MESI_CMP_directory (92ba80d63abc)
* configs/example/se.py (92ba80d63abc)
* configs/ruby/MESI_CMP_directory.py (92ba80d63abc)
* src/mem/protocol/MESI_CMP_directory-L1cache.sm (92ba80d63abc)
* src/mem/protocol/RubySlicc_Types.sm (92ba80d63abc)
* src/mem/ruby/system/RubyPort.hh (92ba80d63abc)
* src/mem/ruby/system/RubyPort.cc (92ba80d63abc)
* src/mem/ruby/system/Sequencer.hh (92ba80d63abc)
* src/mem/ruby/system/Sequencer.cc (92ba80d63abc)