Post by Andreas Hansson Post by Tony Gutierrez
Is there anything holding this up from being shipped?
1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.
2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.
It would be great if someone could dig into these issues.
It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).
It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.
As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.
If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.
I agree, the implementation of locked instructions wouldn't be trivial with the current structure of the cache. It would be nice if we didn't have to abuse the MSHR in that way, it makes it hard to reason about memory leaks, and it can't solve the problem for atomic and functional accesses.
Wouldn't it be easier to implement this in the same way as far atomics? If I understand correctly the semantics of locked instructions, it would be significantly simpler if we could perform these operations with a single RMW packet rather than two separate, a read and then a write. Wouldn't the AtomicOpFunctor in the packet class be sufficient?
Tony would that work for you?
Thanks for the input, Nikos. While it would be a lot simpler to do the atomic ops in the cache using functors, x86 lets you put a LOCK prefix in front of a fairly long list of instructions, and then you have to combine that with all the different data sizes that each instruction supports. Plus it would be a pretty bi disruption to the whole CPU ISA definition to create and plumb those packets through to the memory interface, and it would make the locked instructions copmletely different from their unlocked counterparts. These are all the reasons we didn't do it this way to begin with, though in retrospect it's not as infeasible as perhaps it seemed up front; just complicated and a little ugly. Not necessarily worse than the current patch.
Actually though there is no problem with atomic and functional accesses... it's easy to make them atomic, as you just issue the read and the write within the same event tick/process() method (i.e., without going back to the event queue). It's only timing accesses that are hard with this interface.
I thought that the AtomicOpFunctor solves exactly this issue. While decoding the instruction you pass a function pointer which operates on the data in the memory system. Does the memory system need to be aware of the specifics of the operation? If I understand correctly the cache will only call the function pointer and it won't need to decode the instruction or know anything about it for that matter. I am not really familiar with the LOCK prefix but do we need to support something more than arithmetic operations?
It would be great if this was not necessarily tied to the x86 LOCK prefix but a generic mechanism to support far atomics across different architectures.
The AtomicOpFunctors are supposed to allow various different atomic operations to be sent to the memory or cache controllers in a simple way, without needing to specify all the possible different atomic op types, and provide their functionality, inside of the request. I think it is a very nice mechanism for handling atomics, but ARM had previously stated their desire to remove them altogether. Is this no longer the case, Nikos?
Tony, I only wanted to point out that all the complexity in Steve's implementation was due to the split of the operation in two packets. Implementing the lock prefix with a single packet in the form of an atomic would simplify things a lot.
I took the AtomicOpFunctor as given as it is already commited. I am missing the context here as I haven't followed the discussion on the AtomicOpFunctor but if there are thoughts to change the interface I would imagine that the mechanism/functionality would be pretty much the same. As for the overall idea, I am pretty sure that we are interested in a mechanism for (far)-atomic operations in gem5.
[I thought I had replied to Nikos's earlier comment, but didn't see it in the email thread... apparently I did not click "publish" when I was done. Fortunately the draft was still in reviewboard. So the text below is actually a reply to Nikos's comment of Jan 9.]
Nikos, you are correct about how AtomicOpFunctor operates. The function is opaque to the memory system and is not tied to any particular ISA feature. My point was not about the feature itself, but that a substantial number of functor instances would have to be defined to cover all the possible instructions and data types that can take the LOCK prefix, and that modifying the x86 ISA support to use those functors in place of the current microcode definitions would also be a substantial amount of work.
- AtomicOpFunctor is apparently only defined for Ruby (I believe because it's only currently used by the AMD GPU model, which doesn't work with classic caches anyway). So to use this solution, another prerequisite would be to port AtomicOpFunctor support to the classic cache. That's probably something that should be done anyway, but it is yet more work that no one is signed up to do. There is a similar feature in the classic cache that supports only compare and swap (see Cache::cmpAdnSwap()), so I think the right approach would be to generalize this path to accept AtomicOpFunctors and use that to replace the existing compare-and-swap support.
- More critically, x86 requires atomic operations to work on accesses that cross cache-line boundaries. This works with the current patch (I believe, not sure I explicitly tested this) because the CPU will load and lock both lines into the cache, do the operation, then unlock both lines after they are updated. It's not clear to me how to make this work with AtomicOpFunctor, since the whole cache interface assumes single-cache-line accesses. (All x86 unaligned operations are managed by the CPU, not by the cache.) This could be the ultimate reason why we want to stick with the current approach rather than moving to AtomicOpFunctor.
To give some context we had a lot of resistance when we first introduced the AtomicOpFunctors because, if I am remembering correctly, it was deemed too opaque, and that it would cause issues when connecting gem5 to TLM. We had the owner of that patch here at AMD try to refactor the code to not need to rely on the AtomicOpFunctor, but this was more difficult than we originally thought and it was never completed. The AtomicOpFunctor seems like a more natural solution to the problem, however, so if there is a way we can keep them that would be preferred. Otherwise we need to add support for every possible atomic type to the request, as well as space for the additional atomic data and some specifiers to mark which type of atmoic operation the request is carryning. Not too long ago Andreas mentioned that ARM were workign on something to replace them internally as well, but I haven't heard the status of that.
This is an automatically generated e-mail. To reply, visit:
Post by Andreas Hansson
(Updated April 14, 2016, 10:42 p.m.)
Review request for Default.
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked