Post by Brad Beckmann
src/mem/packet_queue.cc, line 117
I strongly disagree with this change. This buffer should not exist, and even 100 packets is a stretch. Any module hitting this needs to recondiser how it deals with flow control
I would agree with you that the buffer should not exist, but removing the buffer goes well beyond this patch. The worst part about the buffer is the size is not visible to the sender. It is a really bad design, but for now we need to get around the immediate problem. Later we can discuss how and who will fix this right.
We've discussed the proposal in the past to increase the size, but unfortunately we have not came to a resolution. We absolutely need to resolve this now. We cannot use the ruby tester with our upcoming GPU model without this change.
The important thing to keep in mind is that the RubyTester is designed to stress the protocol logic and it creates contention scenarios that would not exist in a real system. The RubyTester finds bugs in the matter of seconds that may not be encountered in hours of real workload simulation. It does this by sending large bursts of racey requests. Bottomline: the RubyTester needs this large value.
I would argue that if you need flow control, you should not use the QueuedPorts, and rather use a "bare-metal" MasterPort, and deal with the flow control. There is no point in adding flow control (or buffer visibility) to the QueuePort in my opinion.
I'd suggest to switch to a MasterPort either as part of this patch, or a patch before it. That way you have no implicit buffer, and no need to create kilobytes of invisible buffering in the system.
The RubyPort and the Ruby Tester predate MasterPorts and QueuedPorts. Your suggestion goes far beyond this patch. Reimplementing RubyPort's m5 ports to inherit from a different base port is a very signficant change with many, many implications beyond the Ruby Tester. That is a pretty unreasonable request.
Please keep in mind that I was not the one who decided RubyPort to use QueuedPorts. That decision was made back in 2012 with patches from your group 8914:8c3bd7bea667, and 8922:17f037ad8918.
Is the queued port really supposed to be a high-fidelity model of anything in the "real" world? I was under the impression that it was a courtesy wrapper around the port so that we don't have 10 different implementations of the same general thing.
IMO, if you are trying to model an on-chip network at high-fidelity you need something more than what is provided by this model anyway. It seems to me that in this case, the queued port is a gem5 interface decision, not a hardware system design decision. Therefore, infinite buffering is not a major problem.
What if this was turned into a "warn" or a "warn_once"? It seems like a good idea to let the user know that the queue is growing wildly as that might be an indication of a deadlock.
(Note: IIRC with our high-bandwidth GPU model we've run into this 100 packet limit as well.)
1) The 100 packet sanity check should definitely not be changed as part of this patch. If we do want to change it, it should be a patch on its own.
2) If there are somehow 10000 responses queued up, then something is truly _fundamentally_ broken, and having a panic telling the user it is broken is helpful imho. Do you not agree?
The QueuedPort is an evolution of the SimpleTimingPort that was used earlier, and I believe the RubyPort has always had this issue (it borrowed a lot of not-so-great design decisions from the Bus for example). The QueuedPort should really only be used as a "lazy" way of implementing non-performance critical devices. I've been meaning to remove it from the classic Caches as well, just not gotten around to it.
In the case of the RubyPort, I do not see the big issue of moving away from QueuedPorts. We have already done so in the Bridge, SimpleMemory etc. The only implication is that we need to deal with the port flow control (retry), and pass that downstream/upstream. Since the RubyPort really does not correspond to any real physical object, with any buffers, I think the case is clear: It should not do any buffering behind the user's back. Do you not think this is reasonable?
Yes, if there is 10000 responses queued up, then there might be something broken. However, 100 responses can be queued up rather easily when running with the Ruby Tester or with a GPU. That is why this patch changes the sanity check from 100 to 10000.
RubyPort is used by a lot of users, with many different scenarios/protocols. Moving it away from QueuedPorts will fundamentally change its behavior. That should be done carefully and I don't want to delay this bug fix with a project that large.
Would you be happy if we introduced a patch that modifies the QueuedSlavePort, RespPacketQueue, and PacketQueue constructors so that the RubyPort can set the transmitList threshold to 10000 and the default can still be set to 100?
Can we simply leave this change out of this patch? I think the right solution here is to fix the dodgy assumptions on implicit buffering, and I'm happy to have a go myself (given that I've done something dimilar to a range of other modules).
The RubyTester and our soon-to-be-posed HSA-compatible GPU will not work without this change. I believe many Ruby users, especially Ruby users evaluating GPUs, have had this type of change in their personal patch queues for a very long time. We need to get this problem out of our personal patch queues and into the main line.
I appreciate your offer to redesign the RubyPort yourself, but even if that were to happen, I would prefer to get this change happen first. A redesign of the RubyPort will require a lot of downstream work to verify things still work as expected.
Why are you so against changing this somewhat arbritrary value, or at least allowing someone to change it? As Jason pointed out, the queue port is an infinite buffer and doesn't represent a real hardware structure. Why so much resistance? This small change really benefits Ruby users and if we make it configurable, it will have no impact on Classic users.
I fundamentally object to the change since it effectively adds an implicit infinite buffer (in quite some places), thus changing the very behaviour of the system and potentially masking problems.
The RubyPort should really just be a glue-logic adaptor. Agreed? In a system architecture, there is no such thing as a RubyPort, and there should not be kilobytes of "magic" buffering imposed on the user. If this buffer is needed, then we should instantiate a buffer, cache, etc. I do not understand the statement that this is needed to make things work. If you need an infinite buffer to make things work, then the solution is broken. Am I missing something?
I would propose to simply remove the QueuedPort, make it inherit from the Master/SlavePort directly, remove the implicit buffer altogether, and simply relay the retry to/from Ruby. Surely Ruby understands flow control...does it not?
Getting rid of implicit infinite buffering is a noble and appropriate long-term goal, I agree. However, in the near term, we already have implicit infinite buffering, and the only thing we're debating here is at what arbitrary point do we warn the user that the use of that infinite buffer is potentially getting abusive. To me it seems obvious that the fundamental near-term problem is that we have this totally arbitrary constant embedded in the code and no way to adjust it based on different circumstances in which different values would be appropriate. I think Brad's suggestion that we make this threshold configurable and then find some way to propagate the value downward seems like the right way to go for now; we can leave the threshold at 100 for CPU-based systems, and crank it up for GPU-based systems, with maybe a third even higher thershold specifically for when we run RubyTester.
I'd rather have one arbitrary constant than three :-), and the lower the better.
Can we just proceed with this patch without this change? At least then I can perhaps get an appreciation for the issue (if you can point me to an example that fails). Right now I really do not understand how this could ever be a problem in any sensible setup.
I agree with Andreas on this one. I don't feel that an "arbitrary" constant should be changed to another "arbitrary" constant in a patch that is trying to fix up the RubyTester. There must be configurations of the RubyTester that will work without increasing the threshold to 10000, so the RubyTester changes will still work for some configurations. These are largely orthogonal issues.
Split the threshold change into a separate patch where we can focus on debating the merits of changes between so-called "arbitrary" buffering limits (e.g. gem5-gpu doesn't require any changes to the RubyPort or QueuedPort buffering thresholds).
We can certainly split this change into a separate patch. It's true that 10,000 is just another arbitrary value; it may make more sense to just add a parameter that lets us turn off this threshold check altogether rather than create another arbitrary constant. Note that the RubyTester is not trying to be a "sensible setup", it's a stress tester that is stressing the protocol implementation beyond the limits of what a "sensible" configuration might be able to do.
I have no problem moving this change to a separate patch, but we still need to provide flexibility on either setting this arbitrary value or turning it off. Leaving it as is will not work for us. We have important configurations of the RubyTester and other internal testers that reach the 100 packet threshold.
I don't want to create a separate patch only to spawn off another week of high-level resistance. Can you confirm that if we create a patch that allows an object to configure the threshold or turn it off, that you will approve it? Do you have a preference between those two options?
The RubyTester (or other testers) is certainly an exceptional case for coherence and consistency incantations aimed at finding bugs. From what I've read about the incoming GPU memory hierarchies in these reviews, however, I'm unsure whether I can be convinced that a GPU core's sequencer should be allowed to queue more than order(100) packets. Hitting this limit with real system components seems like a major red flag that your simulated system is very unbalanced and needs to be reconfigured. We definitely need to see the GPU code to understand that more deeply.
As a separate review request, I would prefer that we add an option to allow very large but finite buffering in all PacketQueue instances of a single simulation. Allowing infinite buffering is likely a bad idea for debugging (imagine a Ruby livelock that just fills a RubyPort response buffer that never unblocks). I'd also prefer comments on that option that strongly suggest it should only be used for exceptional cases, such as testers. Such an option should *not* be exposed on the command line due to obvious potential misuse by a naive user, but instead should be set by the RubyTester simulation config file. I would likely oppose a per-instance PacketQueue buffering limit, because that would just encourage strange, improper use of fake buffering. Until we can actually see your GPU code, I feel that the option should only be allowed to be used with the RubyTester in mainline gem5 code.
Thanks Joel for voicing your preference for the finite buffering case. Once Andreas approves as well, I'll implement and post that patch.
As far as your concern with an unbalanced GPU system, please keep in mind this buffer is for responses. Also in our implementation each work-item issues a separate packet request that is coalesced in the RubyPort. All it takes is 2 wavefronts (2x64 work-items) to be waiting on the same few cache lines. When those cache lines arrive at the GPU L1 cache, 128 packets will be sent back to the GPU cache thus the threshold will be exceeded. I do not think this is a result of an unbalanced system.
I would like to understand the architecture a bit better before simply increasing the limit. Again, I propose we get this patch pushed without this change, since I envision that the discussion has not quite converged.
From the brief description it sounds like the issue arises due to a rather unfortunate use of packets. Perhaps I am missing something, but why not simply add masking or do the coalescing before the "real" port? Are the GPU L1 caches not taking care of this already (or are the L1's in Ruby?).
I had a look at the RubyPort, and it seems all cases of QueuedPort can easily be removed and replaced simply by forwarding retries from one side to the other. The one exception I cannot figure out is the "hit_callback". How can we tell Ruby that a response needs to wait?
Yes this discussion has not quite converge, but I think that is mostly due to your concerns. You say "rather unfortunate use of packets". Perhaps we used packets differently than what you would have expected, but do not believe that deems the use improper or unfortunate. Please be open to uses of gem5 beyond the ones you have previously considered.
The GPU L1 caches are modeled in Ruby. Memory requests are sequenced and coalesced in the RubyPort. This is no different than the other Ruby protocols.
The "hit_callback" is pretty fundamental to Ruby. It has been designed to always "sink" responses because the protocols are set up to reserve the resources ahead of time. I believe that is fairly representative of real hardware and please note resources like register ports are outside of the realm of Ruby. Thus responses will stall inside the RubyPort.
Please let's not stall this necessary change on a fundamental redesign of Ruby.
I think you misunderstood my comment. I think the patch should go ahead...just without the change to the queue limit. No further changes needed at this point.
I used the word "unfortunate" since the packet is a pretty heavy-weight data structure for performing byte strobes. It seems to me we should be able to come up with a much better way to solve this issue (if I have understood what it is you are trying to do). Again, I only see part of the overall picture of your use-case, but we have had related cases internally, and I think we can avoid the overheads of having a packet per byte. Makes sense?
It is fine to assume that reponses sink (eventually) and for deadlock reasons it makes sense to reserve space in advance. However, it is still important to model throughput in the ports, caches and interconnect. For this reason I do not find it too unreasonable to be able to rate regulate responses (which we do in quite a lot of places). If Ruby is unable to do this then I agree that we need an infinite buffer somewhere in the interaction with the "classic" MemObjects.
Again, please push the patch without the queue limit change, and we can revisit this point when we have a better picture of what the options are. Is that ok?
My first instinct is that it's a very bad idea to do GPU coalescing in the RubyPort. The RubyPort is a thin shim that already does too many things (and poorly in a couple cases). However, without seeing the GPU code, I expect it would be hard for you to communicate the constraints on where to do coalescing (e.g. checkpointing, address translation, etc.).
Another aspect to this buffering problem is that you're changing the L1/L0 and Sequencer clock domains back to Ruby's clock. By default, Ruby's clock is 2GHz and most GPUs have lower frequency than this. If the GPU's eject width is, say, 1 packet per lane per GPU cycle, of course you're going to pile up packets within the RubyPort on the response path. This sounds unbalanced to me, especially on a response path.
@Andreas: The hit_callback is problematic: All current coherence protocols call into their RubyPort/Sequencer from their top-level cache controller (L0 or L1) when a memory access is complete. This is a direct call into RubyPort rather than handing the accesses through a port-like interface. Because it is done this way, it is assumed that the RubyPort will not stall these calls. IMO, there should be a response buffer in each L0/L1 cache controller that assesses the L0/L1 latency, and the RubyPort should consume from it like all other Ruby components. However, this would be a lot of work to change.
Without changing the queue threshold or providing a feature that allows one to at least adjust or disable it, we are not resolving the problem, we are just delaying it.
The RubyPort is not a thin shim. I do not believe the Sequencer has ever been a thin shim, even in the GEMS days. It has always been a key component to handling synchronization and gluing cache block memory requests to instruction granular operations. I do not agree that GPU coalescing in the RubyPort is a "very bad idea". If we were to implement it in the other way, we would be arguing over a lot of extensive changes to packet and request. Furthermore, we would not be able to implement per-protocol synchronization mechanisms without having to customize the GPU core for every protocol.
This behavior has very little to do with the ratio of frequencies. If you are concerned with passing the Ruby clock frequency to the L1 caches, would you be happy if I added a L1 cache frequency parameter to create_system, a separate L1 freq value to ruby_system, or added logic that checks for how many cpus are created before trying to reference the cpu's clk domain value? I'm happy to do any of those options. I'm just trying to fix the current bug when more than CPU is specified when using the RubyTester.
The RubyPort/Sequencer *absolutely is* a thin shim. It is very clearly structured as just a set of ports that communicate (i.e. translates messages) between compute cores and Ruby controllers. Its three primary responsibilities are:
1) Translate gem5 packetized requests to Ruby's internal message structure and transmit them to L1 controllers
2) Block accesses as appropriate for coherence/synchronization (e.g. currently, multiple requests to a single line, and LL/SC)
3) Translate messages (e.g. evicts) and responses from Ruby controllers back to gem5 packets for the cores
Outside of these three things, the RubyPort only does simulation management (e.g. stats handling, checkpointing Ruby caches). In other words, it's just the interface between cores and Ruby caches. Further, it's not clear that the RubyPort activities exist in real systems, or if they do, there isn't a special component for them. Coalescers, on the other hand, are very real components of GPU hardware, so on the face, it sounds like improper organization to do coalescing in the RubyPort.
Now, I hesitated raising this concern previously, because the debate about RubyPort responsibility is way outside the scope of this patch. For instance, I can't make any sense of the claims you make about implementing coalescing another way, because I don't even know how you've implemented it in the first place. We *MUST* see GPU code before we can have a legitimate debate about appropriate responsibilities of the RubyPort when GPUs exist in gem5.
Back to the buffering issue: It's quite clear in the code that the RubyPort (and Sequencer) does not adequately account for frequency differences between itself and the requesting core. The Sequencer decrements outstanding requests as soon as the hit callback is called, but then queues the responses in the Sequencer::MemSlavePort. This means that the requesting core can issue further accesses even if there isn't adequate buffering for their responses; The RubyPort won't block on too many outstanding accesses (i.e. not enough reserved response buffers). The Sequencer frequency change almost certainly changes the amount of required buffering in the Sequencer::MemSlavePort.
For setting the Sequencer/L0/L1 frequencies, I might be ok with logic in the protocol config files that checks whether you're simulating with a tester vs. CPU/GPU cores. My only strong opinion is that when simulating a real system, L1 caches be clocked at core frequency rather than uncore (Ruby) frequency (i.e. like basically all existing hardware). However, Nilay seemed to have strong feelings on how to do this, and I have not reviewed that full conversation. I'd prefer to hear his opinion.
This is an automatically generated e-mail. To reply, visit:
Post by Brad Beckmann
(Updated May 11, 2015, 10:28 p.m.)
Review request for Default.
ruby: cleaner ruby tester support
This patch allows the ruby random tester to use ruby ports that may only
support instr or data requests. This patch is similar to a previous changeset
(8932:1b2c17565ac8) that was unfortunately broken by subsequent changesets.
This current patch implements the support in a more straight-forward way.
The patch also includes better DPRINTFs and generalizes the retry behavior
needed by the ruby tester so that other testers/cpu models can use it as well.