Discussion:
[gem5-dev] Review Request: O3CPU: Revive cachePorts per-cycle dcache access limit
Erik Tomusk
2013-05-17 11:03:12 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------

Review request for Default.


Description
-------

Changeset 9722:b5b229a8d919
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.

What this patch does NOT do:
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)

I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.

It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.


Diffs
-----

src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a

Diff: http://reviews.gem5.org/r/1872/diff/


Testing
-------

When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.


Thanks,

Erik Tomusk
Ali Saidi
2013-05-24 03:10:57 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4350
-----------------------------------------------------------


Thanks for this Erik,

My only issue with the way this is handled, is if the CPU runs out of cache ports, I'm pretty sure it will squash the entire pipeline and start re-fetching, which doesn't seem right to me. Anyone else?

Thanks,
Ali



src/cpu/o3/lsq_unit_impl.hh
<http://reviews.gem5.org/r/1872/#comment4095>

Can this just made to be unsigned so <0 isn't possible?


- Ali Saidi
Post by Erik Tomusk
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated May 17, 2013, 4:03 a.m.)
Review request for Default.
Description
-------
Changeset 9722:b5b229a8d919
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Erik Tomusk
2013-06-05 13:13:43 UTC
Permalink
Post by Erik Tomusk
Post by Ali Saidi
Thanks for this Erik,
My only issue with the way this is handled, is if the CPU runs out of cache ports, I'm pretty sure it will squash the entire pipeline and start re-fetching, which doesn't seem right to me. Anyone else?
Thanks,
Ali
Looking at a trace, I think you're right--when the cache blocks, DefaultIEW<Impl>::executeInsts() calls squashDueToMemBlocked(). I don't quite understand why instructions should get squashed on a blocked cache, but I think it makes sense to do the same thing for both a blocked cache and when the ports limit is reached.


- Erik


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4350
-----------------------------------------------------------
Post by Erik Tomusk
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 5:48 a.m.)
Review request for Default.
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Steve Reinhardt
2013-06-06 15:09:24 UTC
Permalink
Post by Erik Tomusk
Post by Ali Saidi
Thanks for this Erik,
My only issue with the way this is handled, is if the CPU runs out of cache ports, I'm pretty sure it will squash the entire pipeline and start re-fetching, which doesn't seem right to me. Anyone else?
Thanks,
Ali
Looking at a trace, I think you're right--when the cache blocks, DefaultIEW<Impl>::executeInsts() calls squashDueToMemBlocked(). I don't quite understand why instructions should get squashed on a blocked cache, but I think it makes sense to do the same thing for both a blocked cache and when the ports limit is reached.
I haven't looked at the code in detail, but it seems like there are two different situations here:
1. When the LSQ knows ahead of time that it can't send any (more) accesses to the dcache
2. When the LSQ finds out after sending an access to the dcache that the access won't succeed

In case #1, there's no reason for any squashing, you just don't send the access to the cache in the first place. This should apply both to the port limits (if you have 2 ports, just send 2 accesses, don't try to send a 3rd) and to the case where the cache is already blocked at the beginning of a cycle and you're waiting for a retry.

Case #2 is the only one where you have a squash, I think, and I believe the only scenario where it applies is when you first find out that the cache is blocked... since you find out by trying to send an access and getting a 'false' response.


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4350
-----------------------------------------------------------
Post by Erik Tomusk
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Erik Tomusk
2013-06-05 12:48:58 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------

(Updated June 5, 2013, 5:48 a.m.)


Review request for Default.


Description (updated)
-------

Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.

What this patch does NOT do:
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)

I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.


Diffs (updated)
-----

src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a

Diff: http://reviews.gem5.org/r/1872/diff/


Testing
-------

When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.


Thanks,

Erik Tomusk
Erik Tomusk
2013-06-05 13:13:49 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------

(Updated June 5, 2013, 6:13 a.m.)


Review request for Default.


Description (updated)
-------

Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.

What this patch does NOT do:
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)

I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.

It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.


Diffs
-----

src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a

Diff: http://reviews.gem5.org/r/1872/diff/


Testing
-------

When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.


Thanks,

Erik Tomusk
Steve Reinhardt
2013-06-06 16:49:24 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4395
-----------------------------------------------------------



src/cpu/o3/lsq_unit.hh
<http://reviews.gem5.org/r/1872/#comment4120>

This implies that both halves of an unaligned access have to go on the same cycle... is my interpretation accurate, and if so, does this restriction make sense? And if it does, why don't we check up front that we have two free ports before sending the first half?

Offhand, I don't see any reason for this restriction. I can't find anything in the official docs, but some googling indicates that unaligned memory accesses are not guaranteed to be atomic, and sending both halves in the same cycle doesn't guarantee atomicity anyway (maybe if both halves are in the same cache line, but definitely not otherwise).


- Steve Reinhardt
Post by Erik Tomusk
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Andreas Hansson
2013-06-06 16:56:19 UTC
Permalink
Post by Erik Tomusk
src/cpu/o3/lsq_unit.hh, line 863
<http://reviews.gem5.org/r/1872/diff/2/?file=35778#file35778line863>
This implies that both halves of an unaligned access have to go on the same cycle... is my interpretation accurate, and if so, does this restriction make sense? And if it does, why don't we check up front that we have two free ports before sending the first half?
Offhand, I don't see any reason for this restriction. I can't find anything in the official docs, but some googling indicates that unaligned memory accesses are not guaranteed to be atomic, and sending both halves in the same cycle doesn't guarantee atomicity anyway (maybe if both halves are in the same cache line, but definitely not otherwise).
It would in fact be much better if we sent it as two separate requests (from the memory system point of view). I am strongly in favour of not attempting to send two things in 0 time.


- Andreas


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4395
-----------------------------------------------------------
Post by Erik Tomusk
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Steve Reinhardt
2013-06-06 17:23:35 UTC
Permalink
Post by Andreas Hansson
src/cpu/o3/lsq_unit.hh, line 863
<http://reviews.gem5.org/r/1872/diff/2/?file=35778#file35778line863>
This implies that both halves of an unaligned access have to go on the same cycle... is my interpretation accurate, and if so, does this restriction make sense? And if it does, why don't we check up front that we have two free ports before sending the first half?
Offhand, I don't see any reason for this restriction. I can't find anything in the official docs, but some googling indicates that unaligned memory accesses are not guaranteed to be atomic, and sending both halves in the same cycle doesn't guarantee atomicity anyway (maybe if both halves are in the same cache line, but definitely not otherwise).
It would in fact be much better if we sent it as two separate requests (from the memory system point of view). I am strongly in favour of not attempting to send two things in 0 time.
It is sent as two separate requests; I don't believe the memory system can tell that they're part of a single unaligned access. I agree that this is good.

My interpretation of this code is that it is constraining these two requests to be sent on the same cycle, but this is rational in that they are sent over separate cache ports. The constraint doesn't make sense to me though, particularly since this implies that an x86 core with only one cache port would probably deadlock if it issued an unaligned access.


- Steve


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4395
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Ali Saidi
2013-06-06 18:19:49 UTC
Permalink
Post by Andreas Hansson
src/cpu/o3/lsq_unit.hh, line 863
<http://reviews.gem5.org/r/1872/diff/2/?file=35778#file35778line863>
This implies that both halves of an unaligned access have to go on the same cycle... is my interpretation accurate, and if so, does this restriction make sense? And if it does, why don't we check up front that we have two free ports before sending the first half?
Offhand, I don't see any reason for this restriction. I can't find anything in the official docs, but some googling indicates that unaligned memory accesses are not guaranteed to be atomic, and sending both halves in the same cycle doesn't guarantee atomicity anyway (maybe if both halves are in the same cache line, but definitely not otherwise).
It would in fact be much better if we sent it as two separate requests (from the memory system point of view). I am strongly in favour of not attempting to send two things in 0 time.
It is sent as two separate requests; I don't believe the memory system can tell that they're part of a single unaligned access. I agree that this is good.
My interpretation of this code is that it is constraining these two requests to be sent on the same cycle, but this is rational in that they are sent over separate cache ports. The constraint doesn't make sense to me though, particularly since this implies that an x86 core with only one cache port would probably deadlock if it issued an unaligned access.
I agree too. I think the reason the code is this way was unaligned support was hacked in. My impression is that most of the reason for the problem is that at decode time you might not know the address that is being loaded/stored so with the model we have of generating instructions it's not clear if one or two instructions should be created.


- Ali


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4395
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Erik Tomusk
2013-06-11 16:49:29 UTC
Permalink
Post by Andreas Hansson
src/cpu/o3/lsq_unit.hh, line 863
<http://reviews.gem5.org/r/1872/diff/2/?file=35778#file35778line863>
This implies that both halves of an unaligned access have to go on the same cycle... is my interpretation accurate, and if so, does this restriction make sense? And if it does, why don't we check up front that we have two free ports before sending the first half?
Offhand, I don't see any reason for this restriction. I can't find anything in the official docs, but some googling indicates that unaligned memory accesses are not guaranteed to be atomic, and sending both halves in the same cycle doesn't guarantee atomicity anyway (maybe if both halves are in the same cache line, but definitely not otherwise).
It would in fact be much better if we sent it as two separate requests (from the memory system point of view). I am strongly in favour of not attempting to send two things in 0 time.
It is sent as two separate requests; I don't believe the memory system can tell that they're part of a single unaligned access. I agree that this is good.
My interpretation of this code is that it is constraining these two requests to be sent on the same cycle, but this is rational in that they are sent over separate cache ports. The constraint doesn't make sense to me though, particularly since this implies that an x86 core with only one cache port would probably deadlock if it issued an unaligned access.
I agree too. I think the reason the code is this way was unaligned support was hacked in. My impression is that most of the reason for the problem is that at decode time you might not know the address that is being loaded/stored so with the model we have of generating instructions it's not clear if one or two instructions should be created.
The problem is that there isn't an elegant way of enforcing a cache ports limit here without refactoring the code that handles split accesses.

Is there consensus on whether split accesses are currently handled correctly or not? If not, is there a consensus on how they should be handled? If so, should those changes also be part of this patch?


- Erik


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4395
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Ali Saidi
2013-07-23 03:38:04 UTC
Permalink
Post by Andreas Hansson
src/cpu/o3/lsq_unit.hh, line 863
<http://reviews.gem5.org/r/1872/diff/2/?file=35778#file35778line863>
This implies that both halves of an unaligned access have to go on the same cycle... is my interpretation accurate, and if so, does this restriction make sense? And if it does, why don't we check up front that we have two free ports before sending the first half?
Offhand, I don't see any reason for this restriction. I can't find anything in the official docs, but some googling indicates that unaligned memory accesses are not guaranteed to be atomic, and sending both halves in the same cycle doesn't guarantee atomicity anyway (maybe if both halves are in the same cache line, but definitely not otherwise).
It would in fact be much better if we sent it as two separate requests (from the memory system point of view). I am strongly in favour of not attempting to send two things in 0 time.
It is sent as two separate requests; I don't believe the memory system can tell that they're part of a single unaligned access. I agree that this is good.
My interpretation of this code is that it is constraining these two requests to be sent on the same cycle, but this is rational in that they are sent over separate cache ports. The constraint doesn't make sense to me though, particularly since this implies that an x86 core with only one cache port would probably deadlock if it issued an unaligned access.
I agree too. I think the reason the code is this way was unaligned support was hacked in. My impression is that most of the reason for the problem is that at decode time you might not know the address that is being loaded/stored so with the model we have of generating instructions it's not clear if one or two instructions should be created.
The problem is that there isn't an elegant way of enforcing a cache ports limit here without refactoring the code that handles split accesses.
Is there consensus on whether split accesses are currently handled correctly or not? If not, is there a consensus on how they should be handled? If so, should those changes also be part of this patch?
Sorry for the long delay, this one seemed to slip through...

I don't know what the consensus is or how these sort of split accesses should be handled. I guess it's no worse than the previous code, so I doesn't need to be part of this patch.

My biggest concern is I can't really get my head around why the CPU needs to squash a bunch of instructions when one of the cache ports blocks. Did you ever look at Steve's comment above. Is it the case that the squashing only happens in the second case?


Sorry again about forgetting about this patch for so long and thanks for posting it,

Ali


- Ali


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4395
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Erik Tomusk
2013-11-26 14:14:38 UTC
Permalink
Post by Andreas Hansson
src/cpu/o3/lsq_unit.hh, line 863
<http://reviews.gem5.org/r/1872/diff/2/?file=35778#file35778line863>
This implies that both halves of an unaligned access have to go on the same cycle... is my interpretation accurate, and if so, does this restriction make sense? And if it does, why don't we check up front that we have two free ports before sending the first half?
Offhand, I don't see any reason for this restriction. I can't find anything in the official docs, but some googling indicates that unaligned memory accesses are not guaranteed to be atomic, and sending both halves in the same cycle doesn't guarantee atomicity anyway (maybe if both halves are in the same cache line, but definitely not otherwise).
It would in fact be much better if we sent it as two separate requests (from the memory system point of view). I am strongly in favour of not attempting to send two things in 0 time.
It is sent as two separate requests; I don't believe the memory system can tell that they're part of a single unaligned access. I agree that this is good.
My interpretation of this code is that it is constraining these two requests to be sent on the same cycle, but this is rational in that they are sent over separate cache ports. The constraint doesn't make sense to me though, particularly since this implies that an x86 core with only one cache port would probably deadlock if it issued an unaligned access.
I agree too. I think the reason the code is this way was unaligned support was hacked in. My impression is that most of the reason for the problem is that at decode time you might not know the address that is being loaded/stored so with the model we have of generating instructions it's not clear if one or two instructions should be created.
The problem is that there isn't an elegant way of enforcing a cache ports limit here without refactoring the code that handles split accesses.
Is there consensus on whether split accesses are currently handled correctly or not? If not, is there a consensus on how they should be handled? If so, should those changes also be part of this patch?
Sorry for the long delay, this one seemed to slip through...
I don't know what the consensus is or how these sort of split accesses should be handled. I guess it's no worse than the previous code, so I doesn't need to be part of this patch.
My biggest concern is I can't really get my head around why the CPU needs to squash a bunch of instructions when one of the cache ports blocks. Did you ever look at Steve's comment above. Is it the case that the squashing only happens in the second case?
Sorry again about forgetting about this patch for so long and thanks for posting it,
Ali
Sorry for the even longer delay. Things just kept coming up...
Given this squashing behavior, I'm starting to doubt whether it's even possible to implement some sort of cache ports limit. Any data-intensive application would end up with a very low IPC from all the squashing. There must be a way this is done in the real world. Does anyone know what that is?
-Erik
[empty comment--bad email settings made last comment bounce back from gem5-dev]


- Erik


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4395
-----------------------------------------------------------
Post by Andreas Hansson
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Repository: gem5
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Andreas Hansson
2013-06-06 17:10:55 UTC
Permalink
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4397
-----------------------------------------------------------


Just a random thought. This patch is a good step in the right direction, but why don't we simply use a vector master port for the D side and cycle through them round robin (or pick which ever is free)?


src/cpu/o3/O3CPU.py
<http://reviews.gem5.org/r/1872/#comment4123>

Data ports?



src/cpu/o3/lsq_unit.hh
<http://reviews.gem5.org/r/1872/#comment4124>

const unsigned int?



src/cpu/o3/lsq_unit.hh
<http://reviews.gem5.org/r/1872/#comment4125>

== rather than >=?



src/cpu/o3/lsq_unit.hh
<http://reviews.gem5.org/r/1872/#comment4126>

When is this ever decremented?

How do we link a decrement of this with getting things going again (or do we simply keep on trying and fail until the number is reduced)?



src/cpu/o3/lsq_unit_impl.hh
<http://reviews.gem5.org/r/1872/#comment4127>

\n at the end?



src/cpu/o3/lsq_unit_impl.hh
<http://reviews.gem5.org/r/1872/#comment4122>

Is the port not also "used" on a blocked request that is waiting for a retry?


- Andreas Hansson
Post by Erik Tomusk
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Erik Tomusk
2013-06-11 16:48:50 UTC
Permalink
Post by Erik Tomusk
Post by Andreas Hansson
Just a random thought. This patch is a good step in the right direction, but why don't we simply use a vector master port for the D side and cycle through them round robin (or pick which ever is free)?
Wouldn't that mean that for every tick we'd have to keep track of not just how many ports have been used, but also which ones? And we might have to include some code next to every dcache access to do a search for unused ports.
Post by Erik Tomusk
Post by Andreas Hansson
src/cpu/o3/O3CPU.py, line 55
<http://reviews.gem5.org/r/1872/diff/2/?file=35776#file35776line55>
Data ports?
I'm not sure either is better. "Cache ports" gets used in the literature, but also means something different in the gem5 context.
Post by Erik Tomusk
Post by Andreas Hansson
src/cpu/o3/lsq_unit.hh, line 839
<http://reviews.gem5.org/r/1872/diff/2/?file=35778#file35778line839>
When is this ever decremented?
How do we link a decrement of this with getting things going again (or do we simply keep on trying and fail until the number is reduced)?
It's reset in tick().
Post by Erik Tomusk
Post by Andreas Hansson
src/cpu/o3/lsq_unit_impl.hh, line 1205
<http://reviews.gem5.org/r/1872/diff/2/?file=35779#file35779line1205>
Is the port not also "used" on a blocked request that is waiting for a retry?
Practically, I don't think it matters--if the cache is blocked in a given cycle, nothing's going in or out regardless of the number of free ports. In terms of real hardware, I think it depends on what caused the cache to block. This patch tries to model the fact that there are only so many physical wires going into each memory cell, and once they've carried some data in a cycle, there's no more time to carry more. If the cache is blocked because of some buffering somewhere, then I guess the answer is no.


- Erik


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://reviews.gem5.org/r/1872/#review4397
-----------------------------------------------------------
Post by Erik Tomusk
-----------------------------------------------------------
http://reviews.gem5.org/r/1872/
-----------------------------------------------------------
(Updated June 5, 2013, 6:13 a.m.)
Review request for Default.
Description
-------
Changeset 9722:7026fe0f45b4
---------------------------
O3CPU: Revive cachePorts per-cycle dcache access limit
This is a stop-gap patch to place a limit on the number of dcache requests the
LSQUnit sends each cycle. Currently, the LSQUnit will send any number of
requests, leading to unrealistic dcache usage. Note that there is an LSQUnit
for each hardware thread, so the cachePorts limit is enforced on a per-thread
basis.
*Limit icache accesses
*Limit dcache accesses from sources other than the LSQUnit (e.g. accesses from
L2)
I'd like to refactor the second half of LSQUnit<Impl>::read(), as it's very
messy. It would be helpful to get feedback on whether what it does is
functionally correct before I do.
It would also be helpful if someone who understands split memory accesses
could check if that bit of code is correct, since I don't know how to test
it.
Diffs
-----
src/cpu/o3/O3CPU.py eb075b2b925a
src/cpu/o3/iew_impl.hh eb075b2b925a
src/cpu/o3/lsq_unit.hh eb075b2b925a
src/cpu/o3/lsq_unit_impl.hh eb075b2b925a
Diff: http://reviews.gem5.org/r/1872/diff/
Testing
-------
When cachePorts is set to 200 (the old value), this patch passes
ARM/tests/fast/long with the exception that the regression complains about
the new statistic.
Thanks,
Erik Tomusk
Loading...