Post by David Hashe Post by Andreas Hansson
I see how this works as a stop gap, but ultimately I would like to push for the removal of the shadow memory as the first option. Is it really that much effort?
I'm not personally familiar enough with why the shadow memory is needed to be able to say how much effort it would take to remove, but I believe so.
Providing background since some might not be familiar with the problem.
__The following links are relevant:__
http://reviews.gem5.org/r/2466 (Joel Hestness' response to Andreas Hansson)
http://reviews.gem5.org/r/2627 (Joel Hestness' comment)
http://reviews.gem5.org/r/3580 (Andreas Hansson's comment)
https://groups.google.com/forum/#!msg/gem5-gpu-dev/hjMJs_bAwlY/tE05yRQfJysJ (Joel Hestness’ comment)
__Why does Ruby need a shadow copy?__
Ruby needs the shadow copy to allow it to do functional accesses in situations where it would normally fail. Functional accesses are generated by system calls or by devices to do functional loading and storing to hack around deficiencies in the device model or runtime.
__What is a functional access?__
A functional access is a memory access that immediately resolves in the memory system. Typically, this involves updating the data value of the memory location without generating any events that go into the event queue. The result is that the memory values appear to have been updated magically without ever creating the events that it would have needed to create if it was operating in the normal manner.
__What's different about functional accesses compared with timing accesses?__
The difference is that functional accesses must complete immediately before returning control back over to the simulation. For example, system calls are executed in an X86 system when the processor executes either 'int 0x80' or 'syscall'. In SE mode, the system call invocation and all of the resulting loads and stores must be completed by the time that we return control back to the simulated process. That single 'syscall' instruction that the processor executes is supposed to represent an entire set of instructions, many of them necessary loads and stores, that would have executed if we were running the code in a real system with an actual kernel.
Timing accesses, on the other hand, are sent through the cache hierarchy and represent what would happen in a "real" system. For timing accesses, the processor creates events that get put into the event queue and are resolved at specific ticks according to the memory model associated with the simulation. Each memory event can generate subsequent events which may or may not modify the cache state and memory state of the simulated system.
__Why can't Ruby handle functional accesses without the shadow copy?__
Well it could handle function accesses without the shadow copy, but it's difficult to implement properly for most protocols. The shadow copy has been considered to be an acceptable crutch to allow protocol writers to avoid the complexities associated with verifying that their protocol is data correct.
Consider the following case: a read request comes into an L1 cache and is about to evict a cache line to be sent to a downstream L2 cache. The eviction is represented by a series of state transitions in Ruby to handle moving the stale data out of the L1 into the L2 or possibly a temporary buffer before copying the new data into the L1 cache. There may be several intermediate states needed to complete transition which are termed transient states. While the cache line’s state machine is in a transient state, data cannot be read or written to the cache line. (Ruby has an assertion in the code to protect against reads on lines that must be due to some of the data being "busy".) The asserts were added because the evicted, old data likely resides in some temporary data structure(s) which are likely not easy to access and update (i.e. MSHR, write buffer, message buffer, request packets, etc.). That doesn’t mean that it’s conceptually impossible to update all of this temporary data; it’s just difficult to do in most cases.
__How does the shadow copy solve the problem?__
The "--access-backing-store" option solves the problem by caching data in a shadow copy of the system memory. __All functional accesses are sent to this shadow copy instead of being directed to the normal, default system memory. Also, hit callbacks from the memory slave ports (which belong to the sequencers that created the request) will write (or read) data into (from) the shadow copy during the hit callback. If I am not mistaken, the hit callback on the memory slave port is equivalent to an L1 hit meaning that the request completed traversing the cache subsystem. In traversing the cache subsystem, the request did touch the default memory through normal behavior, but any returning information carried by the packet will be discarded in favor of what’s in the shadow copy. If data is read from the shadow copy, the request packet (again issued by a timing request) is updated to reflect the shadow copy’s value before the packet is finally handed back to the sequencer.__ The interesting code can be found by searching for "access_backing_store" in "RubyPort.cc".
System call instructions have an ordering semantic that prevents them from being executed before all of the preceding instructions have executed. The ordering semantic protect us from clobbering and/or missing timing accesses with subsequent functional access during the system call. The key thing which protects us here is that the Ruby sequencer needs to tell the processor that the instruction has finished. This cannot happen until the L1's hit callback has returned ensuring that the shadow copy has seen the timing accesses. (Need to verify this by looking through that code, but believe that’s true from previous experience.)
If that’s true, than other functional accesses need to be careful in how they issue instructions or we may see consistency issues caused by value reordering from the cache hierarchy. For instance, consider what might happen if the system call did not have the ordering property. It would be possible to the system call instruction to issue functional accesses to the shadow copy before still active timing accesses were seen by it. (There's no way that the processor could prevent the accesses from occurring by checking normal data dependencies because all it sees is a single instruction: syscall or int0x80.) So, I am a bit wary of seeing functional accesses in weird places. For instance, I wouldn’t embed a functional access into a normal instruction. (I don't know if anyone has ever tried that or if it's even possible, but it would be a bad idea. There might be a magic instruction which does this or someone might try to do it in the future.)
__What happens if we do not have a shadow copy?__
The behavior without a shadow copy of memory (i.e. no --access-backing-store) is kind of interesting. It highlights why we need the shadow copy in the first place (see RubySystem.cc::functional_read/write). Essentially, the functional_writes will always succeed by attempting to write to as much of their state as possible. However, functional_reads can (and will) fail. It’s not completely obvious, but I am confident that the failures stem from the cache lines returning “busy” states caused by recent transitions in the cache hierarchy. (It seems that this is what Nilay is referring to in his summary for reviews.gem5.org/r/2466.)
__Is it possible to remove the shadow copy?__
Yes, it is possible, but it requires a lot of work; more work than most people can reasonably be expected to contribute as an unrelated patch. The solution requires that the protocols are data correct; this entails making all of the functional accesses propagate correctly through temporary variables. Even if it is possible to remove for existing public protcools, it's likely that the protocol developers will want to retain this functionality to help with developing new protocols. Even if that's done, I suspect heavy resistance if we tried to force other developers with private protocols to insure that their protocols are data correct even in the face of functional accesses. It's my understanding that the folks here at AMD aren't the only ones who rely on the shadow copy; I think Wisconsin folks use it too.
Generally, we need better random memory testers to exercise the protocols and uncover problems. In my opinion, that should be the main priority for Ruby developers. I don't have much confidence in running new workloads if the simulation relies on Ruby; the protocols just aren't tested well enough. This memory tester needs to issue functional accesses as well as timing accesses to actually test whether the protocols are always data correct. It's not enough to simply have a few benchmarks that we test in the regressions even if the benchmarks are long running.
Thanks for all the comments Brandon.
The reason why I don't like the original patch is that it confuses atomic and functional (we should really rename the latter to debug accesses to align with SystemC TLM), and does so without any sensible rules/assumptions in place. I guess the shadow memory is quite similar though, as it seems near impossible to actually explain what is correct and/or expected behaviuor when timing and functional accesses interleave.
At the moment we are experiencing actual ordering (and correctness) issues with the interplay between real memory accesses, i.e. timing and atomic, and the debug memory accesses performed using the functional API. In our case the latter are used by a number of wrapped models that are expecting instantaneous updates to memory, and separate function and timing packets. The bottom line is that the use of functional accesses causes big problems when ordering and consistency models matter, and we should try and sort this out before we make even more complicated. We have the same issue with the SST integration. Ultimately the functional read and write accesses should only be used when the guest system is in rest, but clearly we are not sticking to that at the moment.
The MemChecker is a good start in checking that the timing behaviuor works as expected with respect to the consistency model, but it is not clear to me how we could add the functional accesses as part of the check. Also, is it at all used with Ruby today? In the interplay between timing/atomic and functional, what is "right"? Technically the functional/debug accesses do not exist in the simulated guest. I am really keen to hear what people think is right.
This has been a great discussion and I woudl ike to see it continue. However, I think we all should conclude the discussion is somewaht orthogonal to this current patch. The current patch fixes an immediate problem using KVM with Ruby. It has gone through 4 revisions and David responded to many comments and suggestions. Can we please check it in now?
Brandon is right on regarding the tester gap. However, we should be clear on the capablities of the current testers and we should all agree this gap is not unique to Ruby. The current memtest randomizes racy accesses, but only really checks single-writer/multiple-reader cohernce. The current rubytest further stresses races, but only checks SC execution. I'm less familar with the MemChecker, but I believe it siimply monitors a exectuion and ensures SC execution. Please correct me if I'm wrong, but I don't believe it has any concept of an acquire/ld fence or a release/st fence. Bradon's frustration with our recent GPU protocols is due to a lack of a relaxed consistency model checker in gem5. We developed such a tester a few years ago, however that work stopped and I have not been able to find someone to take over the development. It is hard and complicated work. There simply isn't many people that can do it. If anyone is interested, please let me know.
The MemChecker does what you want. At least that's my understanding. Stephan or Marco can comment further.
I think we should make sure we get this right, as it's already quite complex code with the various types of memories. Furthermore, once it's committed we traditionally have limited success in getting design changes done.
Andreas, even if MemChecker does what you think it does, that is orthogonal to this patch. This patch fixes a current problem with KVM and Ruby. We will likely never get rid of the backing store for Ruby, even with a perfect checker.
Do I understand your resistance to checking in this patch, correctly? You don't want us to check this in and instead want us to do it "right" by spending 9-12 man-months of effort to remove the Ruby backing store? Do you fully appreciate the amount of work required? Do you undertand all the benefits of the backing store as Brandon and previously Joel outlined?
I think you've got this all wrong.
The current patch only adds a visibility switch, and all I'm asking is that we document how the various switches combine, and what combinations make sense, and where they are used.
The other patch which mixes atomic/functional needs further discussion.
I'm happy to be wrong here and I appreciate the constructive comments since my previous comment last week. I was pretty frustrated that the discussion was degenerating to a complaint about functional access in Ruby and questioning whether the backing store was required at all. I'm glad we moved on. Hopefully David will respond to the latest set of comments soon and we can get this patch checked in.
By the way, I looked a bit more at MemChecker. It seems like it ensures LLSC operations read permissible values, as well as ensuring non-racy reads and writes are SC. It does not appear to handle release and acquire fences/RMWs that exist in our GPU memory model. More importantly, it is a monitor. It does not initiate memory accesses. I suspect MemChecker is only as good as the tests running on top of it. In contrast, memtest and rubytest are self-contained tests and checkers. There looks to be a lot of valuable code in MemChecker, especially in MemChecker::ByteTracker::inExpectedData(), that may be leveraged in a self-contained release consistency tester/checker. I would be interested to hear Stephan or Marco thoughts on that.
This is an automatically generated e-mail. To reply, visit:
Post by David Hashe
(Updated Aug. 5, 2016, 9:37 p.m.)
Review request for Default.
cpu, mem, sim: Enable KVM support for Ruby
Only map memories into the KVM guest address space that are
marked as usable by KVM.
Remember whether a BackingStoreEntry should be mapped by KVM.
Fix bug causing incomplete draining of Ruby Sequencer.