Discussion:
[gem5-dev] Review Request 2691: mem: implement x86 locked accesses in timing-mode classic cache
(too old to reply)
Steve Reinhardt
2015-03-14 17:19:01 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------

Review request for Default.


Repository: gem5


Description
-------

Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache

Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.


Diffs
-----

src/mem/cache/cache_impl.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.cc 655ff3f6352d7aa4021f8840b68698b277772806

Diff: http://reviews.gem5.org/r/2691/diff/


Testing
-------


Thanks,

Steve Reinhardt
Andreas Hansson
2015-03-17 19:07:36 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review5943
-----------------------------------------------------------



src/mem/cache/cache_impl.hh
<http://reviews.gem5.org/r/2691/#comment5194>

the changed order of the write buffer allocation and dealing with the writebacks is important?


- Andreas Hansson
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated March 14, 2015, 5:19 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.cc 655ff3f6352d7aa4021f8840b68698b277772806
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2015-03-18 22:30:08 UTC
Permalink
Post by Steve Reinhardt
src/mem/cache/cache_impl.hh, line 1417
<http://reviews.gem5.org/r/2691/diff/1/?file=44226#file44226line1417>
the changed order of the write buffer allocation and dealing with the writebacks is important?
Good question. I reordered the two blocks only so I could put the tempBlock clause inside of 'if (!early_exit)' and leave the writebacks code outside. It's necessary to do the writebacks before exiting the function, since that's a local variable and otherwise the writebacks would be lost. In contrast, tempBlock is part of the cache object so it will stick around, and while I haven't thought it through carefully, I assume there could be cases where you're doing a locked RMW in tempBlock, so you don't want to kick it out prematurely.

Looking at handleFill(), it seems unlikely (dare I say impossible) that you would have both writebacks and the use of tempBlock at the same time, since the only time you use tempBlock is if you can't use any existing cache block frame, and if you had bothered to do a writeback then by definition you would be clearing out an existing cache block frame. So by that reasoning the order shouldn't matter.


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review5943
-----------------------------------------------------------
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated March 14, 2015, 10:19 a.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.cc 655ff3f6352d7aa4021f8840b68698b277772806
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Andreas Hansson
2015-03-23 09:47:04 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review5961
-----------------------------------------------------------


Great to see this functionality added. I am not 100% convinced of the way it is done, but perhaps there is no "neater" way of doing it. Especially the new creation of requests and packets is making me nervous. Who is deleting these?

Also, how does this interact with existing MSHRs? Targets and deferred targets? Similarly, now we fake an MSHR even if the block is present in the cache. Would it make sense to flush/downgrade somehow and then treat this like a miss?

- Andreas Hansson
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated March 14, 2015, 5:19 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.cc 655ff3f6352d7aa4021f8840b68698b277772806
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Nilay Vaish
2015-04-07 23:07:01 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review6026
-----------------------------------------------------------



src/mem/cache/cache_impl.hh
<http://reviews.gem5.org/r/2691/#comment5249>

Spelling mistake.


- Nilay Vaish
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated March 14, 2015, 5:19 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.cc 655ff3f6352d7aa4021f8840b68698b277772806
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Stephan Diestelhorst
2015-04-08 20:43:58 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review6028
-----------------------------------------------------------


Get the idea, but I am burnt when it comes to debugging request + packet deep copies.


src/mem/cache/cache_impl.hh
<http://reviews.gem5.org/r/2691/#comment5250>

Since I just spent a few good hours hunting memory allocation bugs... where will these get deallocated again? It is not obvious, so maybe chuck in a comment, here?



src/mem/cache/cache_impl.hh
<http://reviews.gem5.org/r/2691/#comment5251>

Again.. I'd like to see a comment of where these guys (or the originals) are deallocated.



src/mem/cache/cache_impl.hh
<http://reviews.gem5.org/r/2691/#comment5252>

Why don't you break and drop the early_exit?



src/mem/cache/cache_impl.hh
<http://reviews.gem5.org/r/2691/#comment5253>

This should really be:

if (!mshr.hasTargets())

and all the stuff below has too deep an (or two) indentation level, IMHO. I seriously consider goto an appropriate way to improve readability here.


- Stephan Diestelhorst
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated März 14, 2015, 5:19 nachm.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.hh 655ff3f6352d7aa4021f8840b68698b277772806
src/mem/packet.cc 655ff3f6352d7aa4021f8840b68698b277772806
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2015-05-06 22:53:19 UTC
Permalink
Post by Steve Reinhardt
src/mem/cache/cache_impl.hh, line 1359
<http://reviews.gem5.org/r/2691/diff/1/?file=44226#file44226line1359>
if (!mshr.hasTargets())
and all the stuff below has too deep an (or two) indentation level, IMHO. I seriously consider goto an appropriate way to improve readability here.
Changing the test to 'if (!mshr.hasTargets())' is equivalent (I believe) but it's not clear to me that it's better... the only reason that condition could be true would be because early_exit is false, and early_exit is the more meaningful cause, I'd say.

I could use a goto... as I mentioned above, I seriously considered it, but rejected it since it wouldn't completely replace early_exit. I'd have no problem with doing it here, but others might, so I was doing it as an if-block to avoid the controversy. I will post another version of the patch with a goto for comments on this specifically.


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review6028
-----------------------------------------------------------
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated May 6, 2015, 3:21 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/cache/mshr.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2015-05-06 22:56:10 UTC
Permalink
Post by Steve Reinhardt
src/mem/cache/cache_impl.hh, line 1286
<http://reviews.gem5.org/r/2691/diff/1/?file=44226#file44226line1286>
Why don't you break and drop the early_exit?
[Hmm, I had a comment here but reviewboard silently dropped it when I posted.]

Basically break doesn't work here because you still need to exit the code after the if/else-if/else-if... sequence, but before the break. That is, the set of lines starting with:
tgt_pkt->makeTimingResponse();


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review6028
-----------------------------------------------------------
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated May 6, 2015, 3:21 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/cache/mshr.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2015-05-06 22:21:30 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------

(Updated May 6, 2015, 3:21 p.m.)


Review request for Default.


Changes
-------

Updated to tip of tree. Fixed memory leak and spelling error.

Note that I had to remove the recently added 'const' modifier from the pkt field of the MSHR::Target object as part of the update. As mentioned in the comment, we really need to use the existing Packet object for the response and substitute in a dummy packet, since the former has the allocated data pointer for the response data. I tried the option of creating a new packet for the response, but there's no clean way to move an dynamically allocated data pointer from an existing packet to a new one. We could add methods to Packet to do this, but since it's all just to avoid removing 'const' from the pkt field, it seemed easier just to remove 'const'. If anyone has a better idea on how to handle this, I'm listening.


Repository: gem5


Description
-------

Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache

Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.


Diffs (updated)
-----

src/mem/cache/cache_impl.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/cache/mshr.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec

Diff: http://reviews.gem5.org/r/2691/diff/


Testing
-------


Thanks,

Steve Reinhardt
Steve Reinhardt
2015-05-06 22:59:02 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------

(Updated May 6, 2015, 3:59 p.m.)


Review request for Default.


Changes
-------

This is an alternative version of the patch that uses a goto to narrow the scope of the early_exit flag and reduce indentation levels. I'm not necessarily replacing the prior patch with this one; I really just want to post two alternatives, and this seemed easier than creating a whole new patch to review. You probably want to focus just on the diff between this revision and the previous one.


Repository: gem5


Description
-------

Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache

Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.


Diffs (updated)
-----

src/mem/cache/cache_impl.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/cache/mshr.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec

Diff: http://reviews.gem5.org/r/2691/diff/


Testing
-------


Thanks,

Steve Reinhardt
Andreas Hansson
2015-05-07 07:55:00 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review6124
-----------------------------------------------------------


I vote for the non-goto verison.

Is there really no way we can achieve this without the extra packets going back and forth? It seems the shortcircuiting we do with packets we should be able to do without them if we tapped in one level below the recvTimingReq/recvTimingResp.


src/mem/cache/cache_impl.hh (line 527)
<http://reviews.gem5.org/r/2691/#comment5327>

later on we test pkt->isLockedRMW and then read or write. Could we not do the same here rather and avoid looking at the actual command, i.e.

pkt->isLockedRMW() && pkt->isWrite()



src/mem/cache/cache_impl.hh (line 621)
<http://reviews.gem5.org/r/2691/#comment5328>

status should be 0 even, or could it still have the dirty flag set?

the readable flag is essentially valid, so here we invent a new state

Readable = 0
Dirty = ?
Writeable = 0

Still the block is "valid".

This scares me a bit. Partly because the binary encoding of cache states is not that easy to follow, and partly because we now have MOESI + some new state.



src/mem/cache/cache_impl.hh (line 638)
<http://reviews.gem5.org/r/2691/#comment5329>

Why is this MSHR needed? It was faked, was it not?



src/mem/cache/cache_impl.hh (line 639)
<http://reviews.gem5.org/r/2691/#comment5330>

Is it safe to call this here and ignore the return value?

It would be nice if we do not assume that the response is always accepted.

Perhaps schedule an event and put it in a queue rather?



src/mem/cache/cache_impl.hh (line 1296)
<http://reviews.gem5.org/r/2691/#comment5331>

potentially still dirty, or always came from "Exclusive", i.e. Readable | Writeable?



src/mem/packet.hh (line 107)
<http://reviews.gem5.org/r/2691/#comment5326>

Do we need the actual LockedRMW... or could if just be a flag on the existing read and write req/resp?


- Andreas Hansson
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated May 6, 2015, 10:59 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/cache/mshr.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2015-05-07 15:34:12 UTC
Permalink
Post by Steve Reinhardt
Post by Andreas Hansson
I vote for the non-goto verison.
Is there really no way we can achieve this without the extra packets going back and forth? It seems the shortcircuiting we do with packets we should be able to do without them if we tapped in one level below the recvTimingReq/recvTimingResp.
Thanks for the review! I'm not sure what you mean by "one level below" though.

There are lots of ways this can be done; I spent a lot of time thinking about how to do this before I started on it. The key is that you have to check whether a block is locked when snoops or other requests come in, defer those snoops and other accesses in some kind of list during the RMW window, then process that list of deferred snoops/requests when the RMW completes.

Given that we already have this entire mechanism already in place with the MSHRs, it seems obvious to me that we should leverage it. Faking the packets is a little weird but allows us to make the most use of existing code paths and minimizes changes. This is less than 60 net lines of non-comment code added to cache_impl.hh, which I think is pretty good for a feature that I initially expected would be much more complex.

I'm sure there are ways we could do this without the fake packets, at the cost of some significant restructuring (e.g., taking the 'while (mshr->hasTargets())' loop out of recvTimingResp() and making it callable from multiple places) and probably more special-casing in the existing code paths. For example, we could have a flag in MSHR::Target that indicates that the block is locked, and leave the packet pointer null, but then we'd probably end up adding null pointer checks in several places. That was more than I really wanted to bite off though. I suppose I could give it a shot but it might be a while until I have time.
Post by Steve Reinhardt
Post by Andreas Hansson
src/mem/cache/cache_impl.hh, line 527
<http://reviews.gem5.org/r/2691/diff/3/?file=44871#file44871line527>
later on we test pkt->isLockedRMW and then read or write. Could we not do the same here rather and avoid looking at the actual command, i.e.
pkt->isLockedRMW() && pkt->isWrite()
Sure.
Post by Steve Reinhardt
Post by Andreas Hansson
src/mem/cache/cache_impl.hh, line 621
<http://reviews.gem5.org/r/2691/diff/3/?file=44871#file44871line621>
status should be 0 even, or could it still have the dirty flag set?
the readable flag is essentially valid, so here we invent a new state
Readable = 0
Dirty = ?
Writeable = 0
Still the block is "valid".
This scares me a bit. Partly because the binary encoding of cache states is not that easy to follow, and partly because we now have MOESI + some new state.
It could still have the dirty bit set, though practically speaking that's moot, since it will be set by the write part of the RMW regardless. There are other bits like BlkHWPrefetched and BlkSecure that we probably don't want to mess with though.

It is obviously not a MOESI state, but that's OK, because coherence protocol operations will never observe this state, as all snoops are also being deferred. Its only purpose is to make all CPU-side accesses (read or write) go down the miss path, where we can defer them until the RMW completes. There is precedent for this; we also clear the Readable bit on valid blocks in the window after a write miss (while the upgrade request is outstanding); see the comment around line 788 of this file.
Post by Steve Reinhardt
Post by Andreas Hansson
src/mem/cache/cache_impl.hh, line 638
<http://reviews.gem5.org/r/2691/diff/3/?file=44871#file44871line638>
Why is this MSHR needed? It was faked, was it not?
Not necessarily... if the read part of the RMW was an actual cache miss, this is still the non-fake MSHR that was allocated at that time. The initial target of that MSHR will have a fake packet at this point, but there could be real deferred targets waiting behind it.
Post by Steve Reinhardt
Post by Andreas Hansson
src/mem/cache/cache_impl.hh, line 639
<http://reviews.gem5.org/r/2691/diff/3/?file=44871#file44871line639>
Is it safe to call this here and ignore the return value?
It would be nice if we do not assume that the response is always accepted.
Perhaps schedule an event and put it in a queue rather?
recvTimingResp() doesn't have a return value... responses are always accepted.
Post by Steve Reinhardt
Post by Andreas Hansson
src/mem/cache/cache_impl.hh, line 1296
<http://reviews.gem5.org/r/2691/diff/3/?file=44871#file44871line1296>
potentially still dirty, or always came from "Exclusive", i.e. Readable | Writeable?
BlkDirty may have been set in handleFill()
Post by Steve Reinhardt
Post by Andreas Hansson
src/mem/packet.hh, line 107
<http://reviews.gem5.org/r/2691/diff/3/?file=44873#file44873line107>
Do we need the actual LockedRMW... or could if just be a flag on the existing read and write req/resp?
The way MemCmd works, you need a new command for every combination of flags (it's a dense encoding not a sparse encoding, so you look up the flags via the commandInfo[] array). Confusing since it's different from how Request flags are handled, I know.


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review6124
-----------------------------------------------------------
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated May 6, 2015, 3:59 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/cache/mshr.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Andreas Hansson
2016-01-01 14:57:09 UTC
Permalink
Post by Steve Reinhardt
Post by Andreas Hansson
src/mem/cache/cache_impl.hh, line 639
<http://reviews.gem5.org/r/2691/diff/3/?file=44871#file44871line639>
Is it safe to call this here and ignore the return value?
It would be nice if we do not assume that the response is always accepted.
Perhaps schedule an event and put it in a queue rather?
recvTimingResp() doesn't have a return value... responses are always accepted.
Actually responses are not always accepted directly. They are always guarnateed to be sunk without dependencies, but there are modules that use flow control on responses (simply for rate regulation reasons).


- Andreas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review6124
-----------------------------------------------------------
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated May 6, 2015, 10:59 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache_impl.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/cache/mshr.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.hh fbdaa08aaa426b9f4660c366f934ccb670d954ec
src/mem/packet.cc fbdaa08aaa426b9f4660c366f934ccb670d954ec
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2016-01-19 03:46:13 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------

(Updated Jan. 18, 2016, 7:46 p.m.)


Review request for Default.


Changes
-------

This revision doesn't address any reviewer concerns, but it does update the old implementation to work with the latest tree (for those who are still downloading this manually). It may also fix a bug; there was a bug in one of our internal versions but I'm not sure if that bug made it out onto reviewboard or not.


Repository: gem5


Description
-------

Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache

Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.


Diffs (updated)
-----

src/mem/packet.cc d06e5a6b4b7f05a642c3e2bee12cfeb130dede16
src/mem/cache/cache.cc d06e5a6b4b7f05a642c3e2bee12cfeb130dede16
src/mem/cache/mshr.hh d06e5a6b4b7f05a642c3e2bee12cfeb130dede16
src/mem/packet.hh d06e5a6b4b7f05a642c3e2bee12cfeb130dede16

Diff: http://reviews.gem5.org/r/2691/diff/


Testing
-------


Thanks,

Steve Reinhardt
Bjoern A. Zeeb
2016-04-13 14:38:25 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review8197
-----------------------------------------------------------


Can you please update it?
It neither applies cleanly nor does it compile afterwards.

- Bjoern A. Zeeb
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated Jan. 19, 2016, 3:46 a.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 10745:9b84d1b570e3
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/packet.cc d06e5a6b4b7f05a642c3e2bee12cfeb130dede16
src/mem/cache/cache.cc d06e5a6b4b7f05a642c3e2bee12cfeb130dede16
src/mem/cache/mshr.hh d06e5a6b4b7f05a642c3e2bee12cfeb130dede16
src/mem/packet.hh d06e5a6b4b7f05a642c3e2bee12cfeb130dede16
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2016-04-15 05:43:02 UTC
Permalink
Post by Steve Reinhardt
Post by Bjoern A. Zeeb
Can you please update it?
It neither applies cleanly nor does it compile afterwards.
Done.


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review8197
-----------------------------------------------------------
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 14, 2016, 10:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2016-04-15 05:42:04 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------

(Updated April 14, 2016, 10:42 p.m.)


Review request for Default.


Changes
-------

Updated again to work with latest head (rev 11443:df24b9af42c7)


Repository: gem5


Description (updated)
-------

Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache

Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.


Diffs (updated)
-----

src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176

Diff: http://reviews.gem5.org/r/2691/diff/


Testing
-------


Thanks,

Steve Reinhardt
Tony Gutierrez
2017-01-05 16:32:50 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------


Is there anything holding this up from being shipped?

- Tony Gutierrez
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 14, 2016, 10:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Andreas Hansson
2017-01-05 16:43:50 UTC
Permalink
Post by Steve Reinhardt
Post by Tony Gutierrez
Is there anything holding this up from being shipped?
In my view the patch needs two things:

1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.

2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.

It would be great if someone could dig into these issues.


- Andreas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------
Post by Steve Reinhardt
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 15, 2016, 5:42 a.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2017-01-05 19:26:12 UTC
Permalink
Post by Andreas Hansson
Post by Tony Gutierrez
Is there anything holding this up from being shipped?
1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.
2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.
It would be great if someone could dig into these issues.
It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).

It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.

As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.

If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 14, 2016, 10:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Nikos Nikoleris
2017-01-06 15:31:46 UTC
Permalink
Post by Andreas Hansson
Post by Tony Gutierrez
Is there anything holding this up from being shipped?
1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.
2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.
It would be great if someone could dig into these issues.
It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).
It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.
As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.
If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.
I agree, the implementation of locked instructions wouldn't be trivial with the current structure of the cache. It would be nice if we didn't have to abuse the MSHR in that way, it makes it hard to reason about memory leaks, and it can't solve the problem for atomic and functional accesses.

Wouldn't it be easier to implement this in the same way as far atomics? If I understand correctly the semantics of locked instructions, it would be significantly simpler if we could perform these operations with a single RMW packet rather than two separate, a read and then a write. Wouldn't the AtomicOpFunctor in the packet class be sufficient?

Tony would that work for you?


- Nikos


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 15, 2016, 5:42 a.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2017-01-06 18:07:58 UTC
Permalink
Post by Andreas Hansson
Post by Tony Gutierrez
Is there anything holding this up from being shipped?
1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.
2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.
It would be great if someone could dig into these issues.
It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).
It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.
As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.
If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.
I agree, the implementation of locked instructions wouldn't be trivial with the current structure of the cache. It would be nice if we didn't have to abuse the MSHR in that way, it makes it hard to reason about memory leaks, and it can't solve the problem for atomic and functional accesses.
Wouldn't it be easier to implement this in the same way as far atomics? If I understand correctly the semantics of locked instructions, it would be significantly simpler if we could perform these operations with a single RMW packet rather than two separate, a read and then a write. Wouldn't the AtomicOpFunctor in the packet class be sufficient?
Tony would that work for you?
Thanks for the input, Nikos. While it would be a lot simpler to do the atomic ops in the cache using functors, x86 lets you put a LOCK prefix in front of a fairly long list of instructions, and then you have to combine that with all the different data sizes that each instruction supports. Plus it would be a pretty bi disruption to the whole CPU ISA definition to create and plumb those packets through to the memory interface, and it would make the locked instructions copmletely different from their unlocked counterparts. These are all the reasons we didn't do it this way to begin with, though in retrospect it's not as infeasible as perhaps it seemed up front; just complicated and a little ugly. Not necessarily worse than the current patch.

Actually though there is no problem with atomic and functional accesses... it's easy to make them atomic, as you just issue the read and the write within the same event tick/process() method (i.e., without going back to the event queue). It's only timing accesses that are hard with this interface.


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 14, 2016, 10:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Nikos Nikoleris
2017-01-09 10:20:20 UTC
Permalink
Post by Andreas Hansson
Post by Tony Gutierrez
Is there anything holding this up from being shipped?
1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.
2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.
It would be great if someone could dig into these issues.
It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).
It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.
As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.
If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.
I agree, the implementation of locked instructions wouldn't be trivial with the current structure of the cache. It would be nice if we didn't have to abuse the MSHR in that way, it makes it hard to reason about memory leaks, and it can't solve the problem for atomic and functional accesses.
Wouldn't it be easier to implement this in the same way as far atomics? If I understand correctly the semantics of locked instructions, it would be significantly simpler if we could perform these operations with a single RMW packet rather than two separate, a read and then a write. Wouldn't the AtomicOpFunctor in the packet class be sufficient?
Tony would that work for you?
Thanks for the input, Nikos. While it would be a lot simpler to do the atomic ops in the cache using functors, x86 lets you put a LOCK prefix in front of a fairly long list of instructions, and then you have to combine that with all the different data sizes that each instruction supports. Plus it would be a pretty bi disruption to the whole CPU ISA definition to create and plumb those packets through to the memory interface, and it would make the locked instructions copmletely different from their unlocked counterparts. These are all the reasons we didn't do it this way to begin with, though in retrospect it's not as infeasible as perhaps it seemed up front; just complicated and a little ugly. Not necessarily worse than the current patch.
Actually though there is no problem with atomic and functional accesses... it's easy to make them atomic, as you just issue the read and the write within the same event tick/process() method (i.e., without going back to the event queue). It's only timing accesses that are hard with this interface.
I thought that the AtomicOpFunctor solves exactly this issue. While decoding the instruction you pass a function pointer which operates on the data in the memory system. Does the memory system need to be aware of the specifics of the operation? If I understand correctly the cache will only call the function pointer and it won't need to decode the instruction or know anything about it for that matter. I am not really familiar with the LOCK prefix but do we need to support something more than arithmetic operations?

It would be great if this was not necessarily tied to the x86 LOCK prefix but a generic mechanism to support far atomics across different architectures.


- Nikos


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 15, 2016, 5:42 a.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Tony Gutierrez
2017-01-16 18:03:44 UTC
Permalink
Post by Andreas Hansson
Post by Tony Gutierrez
Is there anything holding this up from being shipped?
1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.
2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.
It would be great if someone could dig into these issues.
It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).
It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.
As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.
If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.
I agree, the implementation of locked instructions wouldn't be trivial with the current structure of the cache. It would be nice if we didn't have to abuse the MSHR in that way, it makes it hard to reason about memory leaks, and it can't solve the problem for atomic and functional accesses.
Wouldn't it be easier to implement this in the same way as far atomics? If I understand correctly the semantics of locked instructions, it would be significantly simpler if we could perform these operations with a single RMW packet rather than two separate, a read and then a write. Wouldn't the AtomicOpFunctor in the packet class be sufficient?
Tony would that work for you?
Thanks for the input, Nikos. While it would be a lot simpler to do the atomic ops in the cache using functors, x86 lets you put a LOCK prefix in front of a fairly long list of instructions, and then you have to combine that with all the different data sizes that each instruction supports. Plus it would be a pretty bi disruption to the whole CPU ISA definition to create and plumb those packets through to the memory interface, and it would make the locked instructions copmletely different from their unlocked counterparts. These are all the reasons we didn't do it this way to begin with, though in retrospect it's not as infeasible as perhaps it seemed up front; just complicated and a little ugly. Not necessarily worse than the current patch.
Actually though there is no problem with atomic and functional accesses... it's easy to make them atomic, as you just issue the read and the write within the same event tick/process() method (i.e., without going back to the event queue). It's only timing accesses that are hard with this interface.
I thought that the AtomicOpFunctor solves exactly this issue. While decoding the instruction you pass a function pointer which operates on the data in the memory system. Does the memory system need to be aware of the specifics of the operation? If I understand correctly the cache will only call the function pointer and it won't need to decode the instruction or know anything about it for that matter. I am not really familiar with the LOCK prefix but do we need to support something more than arithmetic operations?
It would be great if this was not necessarily tied to the x86 LOCK prefix but a generic mechanism to support far atomics across different architectures.
The AtomicOpFunctors are supposed to allow various different atomic operations to be sent to the memory or cache controllers in a simple way, without needing to specify all the possible different atomic op types, and provide their functionality, inside of the request. I think it is a very nice mechanism for handling atomics, but ARM had previously stated their desire to remove them altogether. Is this no longer the case, Nikos?


- Tony


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 14, 2016, 10:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Nikos Nikoleris
2017-01-16 22:02:54 UTC
Permalink
Post by Andreas Hansson
Post by Tony Gutierrez
Is there anything holding this up from being shipped?
1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.
2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.
It would be great if someone could dig into these issues.
It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).
It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.
As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.
If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.
I agree, the implementation of locked instructions wouldn't be trivial with the current structure of the cache. It would be nice if we didn't have to abuse the MSHR in that way, it makes it hard to reason about memory leaks, and it can't solve the problem for atomic and functional accesses.
Wouldn't it be easier to implement this in the same way as far atomics? If I understand correctly the semantics of locked instructions, it would be significantly simpler if we could perform these operations with a single RMW packet rather than two separate, a read and then a write. Wouldn't the AtomicOpFunctor in the packet class be sufficient?
Tony would that work for you?
Thanks for the input, Nikos. While it would be a lot simpler to do the atomic ops in the cache using functors, x86 lets you put a LOCK prefix in front of a fairly long list of instructions, and then you have to combine that with all the different data sizes that each instruction supports. Plus it would be a pretty bi disruption to the whole CPU ISA definition to create and plumb those packets through to the memory interface, and it would make the locked instructions copmletely different from their unlocked counterparts. These are all the reasons we didn't do it this way to begin with, though in retrospect it's not as infeasible as perhaps it seemed up front; just complicated and a little ugly. Not necessarily worse than the current patch.
Actually though there is no problem with atomic and functional accesses... it's easy to make them atomic, as you just issue the read and the write within the same event tick/process() method (i.e., without going back to the event queue). It's only timing accesses that are hard with this interface.
I thought that the AtomicOpFunctor solves exactly this issue. While decoding the instruction you pass a function pointer which operates on the data in the memory system. Does the memory system need to be aware of the specifics of the operation? If I understand correctly the cache will only call the function pointer and it won't need to decode the instruction or know anything about it for that matter. I am not really familiar with the LOCK prefix but do we need to support something more than arithmetic operations?
It would be great if this was not necessarily tied to the x86 LOCK prefix but a generic mechanism to support far atomics across different architectures.
The AtomicOpFunctors are supposed to allow various different atomic operations to be sent to the memory or cache controllers in a simple way, without needing to specify all the possible different atomic op types, and provide their functionality, inside of the request. I think it is a very nice mechanism for handling atomics, but ARM had previously stated their desire to remove them altogether. Is this no longer the case, Nikos?
Tony, I only wanted to point out that all the complexity in Steve's implementation was due to the split of the operation in two packets. Implementing the lock prefix with a single packet in the form of an atomic would simplify things a lot.

I took the AtomicOpFunctor as given as it is already commited. I am missing the context here as I haven't followed the discussion on the AtomicOpFunctor but if there are thoughts to change the interface I would imagine that the mechanism/functionality would be pretty much the same. As for the overall idea, I am pretty sure that we are interested in a mechanism for (far)-atomic operations in gem5.


- Nikos


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 15, 2016, 5:42 a.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Steve Reinhardt
2017-01-17 00:05:07 UTC
Permalink
Post by Andreas Hansson
Post by Tony Gutierrez
Is there anything holding this up from being shipped?
1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.
2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.
It would be great if someone could dig into these issues.
It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).
It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.
As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.
If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.
I agree, the implementation of locked instructions wouldn't be trivial with the current structure of the cache. It would be nice if we didn't have to abuse the MSHR in that way, it makes it hard to reason about memory leaks, and it can't solve the problem for atomic and functional accesses.
Wouldn't it be easier to implement this in the same way as far atomics? If I understand correctly the semantics of locked instructions, it would be significantly simpler if we could perform these operations with a single RMW packet rather than two separate, a read and then a write. Wouldn't the AtomicOpFunctor in the packet class be sufficient?
Tony would that work for you?
Thanks for the input, Nikos. While it would be a lot simpler to do the atomic ops in the cache using functors, x86 lets you put a LOCK prefix in front of a fairly long list of instructions, and then you have to combine that with all the different data sizes that each instruction supports. Plus it would be a pretty bi disruption to the whole CPU ISA definition to create and plumb those packets through to the memory interface, and it would make the locked instructions copmletely different from their unlocked counterparts. These are all the reasons we didn't do it this way to begin with, though in retrospect it's not as infeasible as perhaps it seemed up front; just complicated and a little ugly. Not necessarily worse than the current patch.
Actually though there is no problem with atomic and functional accesses... it's easy to make them atomic, as you just issue the read and the write within the same event tick/process() method (i.e., without going back to the event queue). It's only timing accesses that are hard with this interface.
I thought that the AtomicOpFunctor solves exactly this issue. While decoding the instruction you pass a function pointer which operates on the data in the memory system. Does the memory system need to be aware of the specifics of the operation? If I understand correctly the cache will only call the function pointer and it won't need to decode the instruction or know anything about it for that matter. I am not really familiar with the LOCK prefix but do we need to support something more than arithmetic operations?
It would be great if this was not necessarily tied to the x86 LOCK prefix but a generic mechanism to support far atomics across different architectures.
The AtomicOpFunctors are supposed to allow various different atomic operations to be sent to the memory or cache controllers in a simple way, without needing to specify all the possible different atomic op types, and provide their functionality, inside of the request. I think it is a very nice mechanism for handling atomics, but ARM had previously stated their desire to remove them altogether. Is this no longer the case, Nikos?
Tony, I only wanted to point out that all the complexity in Steve's implementation was due to the split of the operation in two packets. Implementing the lock prefix with a single packet in the form of an atomic would simplify things a lot.
I took the AtomicOpFunctor as given as it is already commited. I am missing the context here as I haven't followed the discussion on the AtomicOpFunctor but if there are thoughts to change the interface I would imagine that the mechanism/functionality would be pretty much the same. As for the overall idea, I am pretty sure that we are interested in a mechanism for (far)-atomic operations in gem5.
[I thought I had replied to Nikos's earlier comment, but didn't see it in the email thread... apparently I did not click "publish" when I was done. Fortunately the draft was still in reviewboard. So the text below is actually a reply to Nikos's comment of Jan 9.]

Nikos, you are correct about how AtomicOpFunctor operates. The function is opaque to the memory system and is not tied to any particular ISA feature. My point was not about the feature itself, but that a substantial number of functor instances would have to be defined to cover all the possible instructions and data types that can take the LOCK prefix, and that modifying the x86 ISA support to use those functors in place of the current microcode definitions would also be a substantial amount of work.

Two other things I have thought of/noticed since my previous comment:
- AtomicOpFunctor is apparently only defined for Ruby (I believe because it's only currently used by the AMD GPU model, which doesn't work with classic caches anyway). So to use this solution, another prerequisite would be to port AtomicOpFunctor support to the classic cache. That's probably something that should be done anyway, but it is yet more work that no one is signed up to do. There is a similar feature in the classic cache that supports only compare and swap (see Cache::cmpAdnSwap()), so I think the right approach would be to generalize this path to accept AtomicOpFunctors and use that to replace the existing compare-and-swap support.
- More critically, x86 requires atomic operations to work on accesses that cross cache-line boundaries. This works with the current patch (I believe, not sure I explicitly tested this) because the CPU will load and lock both lines into the cache, do the operation, then unlock both lines after they are updated. It's not clear to me how to make this work with AtomicOpFunctor, since the whole cache interface assumes single-cache-line accesses. (All x86 unaligned operations are managed by the CPU, not by the cache.) This could be the ultimate reason why we want to stick with the current approach rather than moving to AtomicOpFunctor.


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 14, 2016, 10:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Tony Gutierrez
2017-01-18 16:42:58 UTC
Permalink
Post by Andreas Hansson
Post by Tony Gutierrez
Is there anything holding this up from being shipped?
1) Some thought around the design. I am still hoping there is a less invasive way of accommodating the functionality, possibly with some changes to the existing functions of the cache.
2) A way to test it. This should preferably include both synthetic and real use-cases. The memtester and memchecker may be a good starting point for the synthetic part.
It would be great if someone could dig into these issues.
It's been quite a while since I've looked at this... in fact I was going to say that there might be a memory leak, but looking back at the patch history I see I fixed that already in May 2015 :).
It's pretty easy to come up with a simple test case, just using C++11 atomic ops or pthread mutexes. Unfortunately the ones I had were on my AMD work system that I had to give back, and it looks like I missed them when I went to copy off useful non-proprietary stuff before I left. I know I didn't spend much time writing tests though, so they should be easy to recreate.
As far as being less invasive, I think my reply to Andreas's comment of May 7, 2015 is still relevant: given the current structure of the cache code, I think the existing solution is really not that invasive. There isn't really that much code being added here; most of it is comments. Generating a synthetic request to allow allocating an MSHR so that we can accumulate the deferred operations that happen while the lock is held is a little strange, but it allows us to reuse the bulk of the code in recvTimingResp() with just a few changes.
If the cache code were heavily restructured, we could probably layer it in such a way that the fake packets would not be necessary, and thus the invasiveness of the locked RMW support could be reduced. However, I started down this path a long time ago (shortly after his comment), and after putting a fair amount of work into the restructuring, I still didn't even have the code working at all, so I gave up. Thus I think the net invasiveness of restructuring the cache to reduce the amount of change required specifically for this feature is much much larger than the invasiveness of this patch by itself. The end result would probably be better code, but it's a huge amount of work that I don't see anyone signing up for right now, so I think it's better to put this change in as is for now, and put "restructuring the cache" on the to-do list.
I agree, the implementation of locked instructions wouldn't be trivial with the current structure of the cache. It would be nice if we didn't have to abuse the MSHR in that way, it makes it hard to reason about memory leaks, and it can't solve the problem for atomic and functional accesses.
Wouldn't it be easier to implement this in the same way as far atomics? If I understand correctly the semantics of locked instructions, it would be significantly simpler if we could perform these operations with a single RMW packet rather than two separate, a read and then a write. Wouldn't the AtomicOpFunctor in the packet class be sufficient?
Tony would that work for you?
Thanks for the input, Nikos. While it would be a lot simpler to do the atomic ops in the cache using functors, x86 lets you put a LOCK prefix in front of a fairly long list of instructions, and then you have to combine that with all the different data sizes that each instruction supports. Plus it would be a pretty bi disruption to the whole CPU ISA definition to create and plumb those packets through to the memory interface, and it would make the locked instructions copmletely different from their unlocked counterparts. These are all the reasons we didn't do it this way to begin with, though in retrospect it's not as infeasible as perhaps it seemed up front; just complicated and a little ugly. Not necessarily worse than the current patch.
Actually though there is no problem with atomic and functional accesses... it's easy to make them atomic, as you just issue the read and the write within the same event tick/process() method (i.e., without going back to the event queue). It's only timing accesses that are hard with this interface.
I thought that the AtomicOpFunctor solves exactly this issue. While decoding the instruction you pass a function pointer which operates on the data in the memory system. Does the memory system need to be aware of the specifics of the operation? If I understand correctly the cache will only call the function pointer and it won't need to decode the instruction or know anything about it for that matter. I am not really familiar with the LOCK prefix but do we need to support something more than arithmetic operations?
It would be great if this was not necessarily tied to the x86 LOCK prefix but a generic mechanism to support far atomics across different architectures.
The AtomicOpFunctors are supposed to allow various different atomic operations to be sent to the memory or cache controllers in a simple way, without needing to specify all the possible different atomic op types, and provide their functionality, inside of the request. I think it is a very nice mechanism for handling atomics, but ARM had previously stated their desire to remove them altogether. Is this no longer the case, Nikos?
Tony, I only wanted to point out that all the complexity in Steve's implementation was due to the split of the operation in two packets. Implementing the lock prefix with a single packet in the form of an atomic would simplify things a lot.
I took the AtomicOpFunctor as given as it is already commited. I am missing the context here as I haven't followed the discussion on the AtomicOpFunctor but if there are thoughts to change the interface I would imagine that the mechanism/functionality would be pretty much the same. As for the overall idea, I am pretty sure that we are interested in a mechanism for (far)-atomic operations in gem5.
[I thought I had replied to Nikos's earlier comment, but didn't see it in the email thread... apparently I did not click "publish" when I was done. Fortunately the draft was still in reviewboard. So the text below is actually a reply to Nikos's comment of Jan 9.]
Nikos, you are correct about how AtomicOpFunctor operates. The function is opaque to the memory system and is not tied to any particular ISA feature. My point was not about the feature itself, but that a substantial number of functor instances would have to be defined to cover all the possible instructions and data types that can take the LOCK prefix, and that modifying the x86 ISA support to use those functors in place of the current microcode definitions would also be a substantial amount of work.
- AtomicOpFunctor is apparently only defined for Ruby (I believe because it's only currently used by the AMD GPU model, which doesn't work with classic caches anyway). So to use this solution, another prerequisite would be to port AtomicOpFunctor support to the classic cache. That's probably something that should be done anyway, but it is yet more work that no one is signed up to do. There is a similar feature in the classic cache that supports only compare and swap (see Cache::cmpAdnSwap()), so I think the right approach would be to generalize this path to accept AtomicOpFunctors and use that to replace the existing compare-and-swap support.
- More critically, x86 requires atomic operations to work on accesses that cross cache-line boundaries. This works with the current patch (I believe, not sure I explicitly tested this) because the CPU will load and lock both lines into the cache, do the operation, then unlock both lines after they are updated. It's not clear to me how to make this work with AtomicOpFunctor, since the whole cache interface assumes single-cache-line accesses. (All x86 unaligned operations are managed by the CPU, not by the cache.) This could be the ultimate reason why we want to stick with the current approach rather than moving to AtomicOpFunctor.
Hi Nikos,

To give some context we had a lot of resistance when we first introduced the AtomicOpFunctors because, if I am remembering correctly, it was deemed too opaque, and that it would cause issues when connecting gem5 to TLM. We had the owner of that patch here at AMD try to refactor the code to not need to rely on the AtomicOpFunctor, but this was more difficult than we originally thought and it was never completed. The AtomicOpFunctor seems like a more natural solution to the problem, however, so if there is a way we can keep them that would be preferred. Otherwise we need to add support for every possible atomic type to the request, as well as space for the additional atomic data and some specifiers to mark which type of atmoic operation the request is carryning. Not too long ago Andreas mentioned that ARM were workign on something to replace them internally as well, but I haven't heard the status of that.


- Tony


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/2691/#review9228
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/2691/
-----------------------------------------------------------
(Updated April 14, 2016, 10:42 p.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 11444:8a1419dbbfa6
---------------------------
mem: implement x86 locked accesses in timing-mode classic cache
Add LockedRMW(Read|Write)(Req|Resp) commands. In timing mode,
use a combination of clearing permission bits and leaving
an MSHR in place to prevent accesses & snoops from touching
a locked block between the read and write parts of an locked
RMW sequence.
Diffs
-----
src/mem/cache/cache.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/cache/mshr.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.hh df24b9af42c72606f1fa8e5aa0502b53e81ea176
src/mem/packet.cc df24b9af42c72606f1fa8e5aa0502b53e81ea176
Diff: http://reviews.gem5.org/r/2691/diff/
Testing
-------
Thanks,
Steve Reinhardt
Loading...