Discussion:
[gem5-dev] One failing Ruby regression after memory-system patches
(too old to reply)
Andreas Hansson
2012-01-03 17:36:12 UTC
Permalink
Dear all (and Brad in particular),

With the memory-system patch http://reviews.m5sim.org/r/949/ applied, all regressions work, besides one: build/ALPHA_SE_MOESI_hammer/tests/opt/quick/00.hello/alpha/linux/simple-timing-ruby-MOESI_hammer

simerr contains:
warn: Sockets disabled, not accepting gdb connections
fatal: Ruby functional write failed for address 0x89580
@ cycle 136455
[recvFunctional:build/ALPHA_SE_MOESI_hammer/mem/ruby/system/RubyPort.cc, line 449]
Memory Usage: 242204 Kbytes

Changing the fatal to a warn allows the regression to succeed, and with the simerr containing:
warn: Sockets disabled, not accepting gdb connections
warn: Ruby functional write failed for address 0x89580
hack: be nice to actually delete the event here

It seems very strange that all other regressions are successful and this one not. Could it be a bug in the Ruby code? I strongly doubt it is related to patch 949, but do not know Ruby well enough to say for sure.

Ideas and suggestions are welcome.

Thanks.

Andreas


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Nilay Vaish
2012-01-03 17:48:53 UTC
Permalink
Can you provide the trace obtained with debug flag Ruby? If that is too
long, may be with RubyPort only.

--
Nilay
Post by Andreas Hansson
Dear all (and Brad in particular),
With the memory-system patch http://reviews.m5sim.org/r/949/ applied, all regressions work, besides one: build/ALPHA_SE_MOESI_hammer/tests/opt/quick/00.hello/alpha/linux/simple-timing-ruby-MOESI_hammer
warn: Sockets disabled, not accepting gdb connections
fatal: Ruby functional write failed for address 0x89580
@ cycle 136455
[recvFunctional:build/ALPHA_SE_MOESI_hammer/mem/ruby/system/RubyPort.cc, line 449]
Memory Usage: 242204 Kbytes
warn: Sockets disabled, not accepting gdb connections
warn: Ruby functional write failed for address 0x89580
hack: be nice to actually delete the event here
It seems very strange that all other regressions are successful and this one not. Could it be a bug in the Ruby code? I strongly doubt it is related to patch 949, but do not know Ruby well enough to say for sure.
Ideas and suggestions are welcome.
Thanks.
Andreas
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Beckmann, Brad
2012-01-06 20:18:42 UTC
Permalink
Hi Andreas,

(moving back to gem5-dev since I suspect other will be interested)

I've dug myself out of my email hole and I think I can help out here. I read through your trace and I know what is going on. As Nilay already mentioned, we know that functional accesses, especially functional writes, will not be successful if they race with timing requests. Even though the hello world test uses a single in-order SimpleTiming CPU, a timing request is racing with the functional write. Specifically, the writeback of block 0x89580 and the directory waiting for the data to be written to DRAM, is racing with the fstat syscall's functional write to the same block. I know it is a little hard to figure all that out from staring at the current trace with all Ruby flags turned on. In the future, I would recommend just turning on the ProtocolTrace Flag. It will be much easier
to understand what is going on.

Though I think I understand the problem, I'm not quite sure how to fix it. When Nilay added functional access support to Ruby, Nilay and I were hoping this situation would not occur. However, since this is just the simple 1-cpu hello world test, I think it is pretty obvious that we are going to have to deal with this situation somehow. We could just deal with these situations one-by-one by modifying the AccessPermissions for particular states. Specifically here we and solve this problem by setting the permission of Dir:WB_E_W to Read_Write. However, is that how we want to try to solve all these issues? There are certain races that we simply can't get around by better specifying AccessPermissions.

Nilay, what do you think?

Brad
-----Original Message-----
Sent: Friday, January 06, 2012 7:49 AM
To: Nilay Vaish
Cc: Beckmann, Brad; Ali Saidi
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system
patches
Hi Nilay,
I have done some more digging and these are the final events before the
SyscallDesc::doSyscall (this=0x1240f08, callnum=91,
process=0x1c8aa80, tc=0x1c932a0)
2. this in turn calls SyscallReturn fstatFunc<AlphaLinux>(SyscallDesc*, int,
LiveProcess*, ThreadContext*) ()
3. this in turn calls writeBlob on the SETranslatingPortProxy
SETranslatingPortProxy::writeBlob (
this=<value optimized out>, addr=4831384928, p=<value optimized out>,
size=<value optimized out>)
at
build/ALPHA_SE_MOESI_hammer/mem/se_translating_port_proxy.cc:125
This is done to address 0x89560 and the blobHelper chops it up in two pieces.
136455: system_port: system_port blobHelper to address 0x89560
136455: system.sys_port_proxy-port0: Functional access caught for address
0x89560
136455: system.sys_port_proxy-port0: Request found in 0 - 0x7ffffff range
136455: system.sys_port_proxy-port0: Functional Write request for
[0x89560, line 0x89540]
136455: system.sys_port_proxy-port0: num_busy = 0, num_ro = 0, num_rw
= 1
136455: system.sys_port_proxy-port0: [ 0xb0 0xab 0x3 0x20 0x1 0x0 0x0 0x0
0x40 0x45 0x8 0x20 0x1 0x0 0x0 0x0 0x0 0x20 0x0 0x0 0x0 0x0 0x0 0x0 0xd 0x0
0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 ]
136455: system.sys_port_proxy-port0: [ 0xb0 0xab 0x3 0x20 0x1 0x0 0x0 0x0
0x40 0x45 0x8 0x20 0x1 0x0 0x0 0x0 0x0 0x20 0x0 0x0 0x0 0x0 0x0 0x0 0xd 0x0
0x0 0x0 0x0 0x0 0x0 0x0 0xa 0x0 0x0 0x0 0xe 0x9d 0xc1 0x2a 0xe8 0x21 0x0 0x0
0x1 0x0 0x0 0x0 0x17 0x87 0x0 0x0 0xbb 0x2 0x0 0x0 0xd 0x88 0x0 0x0 0x0 0x0
0x0 0x0 ]
136455: system.sys_port_proxy-port0: [ 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 ]
136455: system.sys_port_proxy-port0: [ 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
0x0 0x0 0x0 0x0 0xa 0x0 0x0 0x0 0xe 0x9d 0xc1 0x2a 0xe8 0x21 0x0 0x0 0x1 0x0
0x0 0x0 0x17 0x87 0x0 0x0 0xbb 0x2 0x0 0x0 0xd 0x88 0x0 0x0 0x0 0x0 0x0 0x0 ]
136455: system.physmem: Write of size 32 on address 0x89560
136455: system.physmem: 00000000 0a 00 00 00 0e 9d c1 2a e8 21 00 00 01 00
00 00 A*h!
136455: system.physmem: 00000010 17 87 00 00 bb 02 00 00 0d 88 00 00 00
00 00 00 ;
136455: system.sys_port_proxy-port0: Functional access successful!
136455: system_port: system_port blobHelper to address 0x89580
136455: system.sys_port_proxy-port0: Functional access caught for address
0x89580
136455: system.sys_port_proxy-port0: Request found in 0 - 0x7ffffff range
136455: system.sys_port_proxy-port0: Functional Write request for
[0x89580, line 0x89580]
136455: system.sys_port_proxy-port0: num_busy = 1, num_ro = 0, num_rw
= 0
This is where the panic kicks in and kills the simulation. I am still puzzled how
all other regressions work and this one fails. Any ideas what could be going
wrong?
Andreas
-----Original Message-----
Sent: 04 January 2012 10:21
To: Andreas Hansson
Cc: Beckmann, Brad; Ali Saidi
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system
patches
Hi Nilay,
Thanks for the swift response.
I would think the functional access is being made either to: 1) load a
binary, or 2) "fake" an access from some thread/process. Patch 949
essentially forces all functional accesses to go through a real
structural port, so the path through the interconnect may now be
different (and it could have been bypassed altogether in the past as
some functional ports connected straight to memory and ignored any
data buffered in the interconnect). There should be no timing changes
due to the patch as it only affects untimed functional accesses.
In case of Ruby, when a functional access is received at the Ruby Port, all the
controllers are checked for whether or not they have the cache line for this
address and in what state. When this particular functional access is made,
one controller is already trying to access (in timing
mode) the cache line, but the data is still in the interconnect some where.
Given your explanation, are you trying to imply that earlier this particular
functional access was not going through Ruby?
Would you think changing the panic to a warn is the way to go?
Well, as you said this access might be needed for loading a binary, would not
an error in loading the binary result in something bad happening sooner or
later? Unless that functional access is retried at a different instant of time,
this is a situation for panicing.
--
Nilay
Andreas
-----Original Message-----
Sent: 03 January 2012 19:05
To: Andreas Hansson
Cc: Beckmann, Brad; Ali Saidi
Subject: RE: [gem5-dev] One failing Ruby regression after
memory-system patches
Andreas, reading the trace, it does not seem anything is going on wrong.
We know that functional accesses in Ruby can fail and in this case it
fails because the data to be accessed functionally is some where in
the inter connection network.
But this is a regression test which has been in existence before the
functional access support was added to Ruby. I have no idea as to why
a functional access is being made. Also patch 949 does not provide any
clues. To me it seems like, it would not affect Ruby at all.
Can you track the source of these functional requests? Also, do your
patches change the way things are timed currently?
--
Nilay
The output with Ruby as the only flag is attached. Sorry for the
large file (still 2 MB but a good reduction from the 76 MB unpacked).
I have kept this off the mailing list intentionally.
Let me know if I can provide any further information.
Thanks!
Andreas
-----Original Message-----
Behalf Of Nilay Vaish
Sent: 03 January 2012 17:49
To: gem5 Developer List
Subject: Re: [gem5-dev] One failing Ruby regression after
memory-system patches
Can you provide the trace obtained with debug flag Ruby? If that is
too long, may be with RubyPort only.
--
Nilay
Post by Andreas Hansson
Dear all (and Brad in particular),
With the memory-system patch http://reviews.m5sim.org/r/949/
build/ALPHA_SE_MOESI_hammer/tests/opt/quick/00.hello/alpha/linux/sim
Post by Andreas Hansson
ple-timing-ruby-MOESI_hammer
warn: Sockets disabled, not accepting gdb connections
136455
[recvFunctional:build/ALPHA_SE_MOESI_hammer/mem/ruby/system/Ruby
Port
Post by Andreas Hansson
.cc, line 449] Memory Usage: 242204 Kbytes
Changing the fatal to a warn allows the regression to succeed, and with
warn: Sockets disabled, not accepting gdb connections
warn: Ruby functional write failed for address 0x89580
hack: be nice to actually delete the event here
It seems very strange that all other regressions are successful and this
one not. Could it be a bug in the Ruby code? I strongly doubt it is related to
patch 949, but do not know Ruby well enough to say for sure.
Post by Andreas Hansson
Ideas and suggestions are welcome.
Thanks.
Andreas
-- IMPORTANT NOTICE: The contents of this email and any attachments
are confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy the
information in any medium. Thank you.
Post by Andreas Hansson
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
-- IMPORTANT NOTICE: The contents of this email and any attachments
are confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy the
information in any medium. Thank you.
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
Nilay Vaish
2012-01-06 22:13:32 UTC
Permalink
Post by Beckmann, Brad
Hi Andreas,
(moving back to gem5-dev since I suspect other will be interested)
I've dug myself out of my email hole and I think I can help out here.
I read through your trace and I know what is going on. As Nilay already
mentioned, we know that functional accesses, especially functional
writes, will not be successful if they race with timing requests. Even
though the hello world test uses a single in-order SimpleTiming CPU, a
timing request is racing with the functional write. Specifically, the
writeback of block 0x89580 and the directory waiting for the data to be
written to DRAM, is racing with the fstat syscall's functional write to
the same block. I know it is a little hard to figure all that out from
staring at the current trace with all Ruby flags turned on. In the
future, I would recommend just turning on the ProtocolTrace Flag. It
will be much easier to understand what is going on.
Though I think I understand the problem, I'm not quite sure how to fix
it. When Nilay added functional access support to Ruby, Nilay and I
were hoping this situation would not occur. However, since this is just
the simple 1-cpu hello world test, I think it is pretty obvious that we
are going to have to deal with this situation somehow. We could just
deal with these situations one-by-one by modifying the AccessPermissions
for particular states. Specifically here we and solve this problem by
setting the permission of Dir:WB_E_W to Read_Write. However, is that
how we want to try to solve all these issues? There are certain races
that we simply can't get around by better specifying AccessPermissions.
Nilay, what do you think?
Brad
I think we should try to understand as to why this problem is occurring in
first place. Andreas, in one of the earlier emails, mentioned that these
memory-system patches do not introduce any timing changes. The only other
reason I can think of why this test is failing, is that these accesses did
not used to go through Ruby earlier. This seems strange, but may be that
is true.

Andreas, is it possible for you to figure out what particular change in
the memory system is making this test fail?

Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it might be
all right to change the permissions in this case. We might want to switch
to a single copy of each cache block in order to avoid this problem. Do we
really need the data to reside in the interconnection network to carry out
a simulation? Can we not have fake data in the network and the actual data
always resides at one single place?

Nilay
Beckmann, Brad
2012-01-06 23:58:18 UTC
Permalink
I think we should try to understand as to why this problem is occurring in first
place. Andreas, in one of the earlier emails, mentioned that these memory-
system patches do not introduce any timing changes. The only other reason I
can think of why this test is failing, is that these accesses did not used to go
through Ruby earlier. This seems strange, but may be that is true.
The problem occurs because of a race between timing requests and function requests that come an emulated system call that doesn't appear to have been modified in years. I doubt there is anything in Andreas's patches that directly cause this problem. They probably just reorder the requests in a particular way that now cause the rare race to occur with the hammer protocol. Having a functional access race with a timing writeback seems like a very rare situation. I'm not surprised we haven't seen this before.
Andreas, is it possible for you to figure out what particular change in the
memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it might be all
right to change the permissions in this case. We might want to switch to a
single copy of each cache block in order to avoid this problem. Do we really
need the data to reside in the interconnection network to carry out a
simulation? Can we not have fake data in the network and the actual data
always resides at one single place?
I'd rather not remove data from the interconnect. That is certainly not in the spirit of "execute at execute". Having data exist in one single place is what we do today with Ruby's backing copy of physmem. If we have data always reside in one single place, then we might as well remove all of Ruby's functional access support and go back to just sending all functional accesses to physmem.

For the particular problem we're seeing today, data is not stuck in the interconnection network. Rather it is just stuck in the DRAM request queue that simulates the timing of the DRAM interface. The data itself has already been written to DirectoryMemory.

Overall, I'm not happy with any solution that comes to my mind. I don't like having to deal with these problems one-by-one, nor do I want to remove Ruby's functional access support. I also don't want to have to build some sort of complicated mechanism that tries to identify valid data floating in any Ruby buffer (network, DRAM, etc.) because I don't see how one can do that without putting a lot of burden/restriction on the protocol writer.

Brad
Andreas Hansson
2012-01-09 13:58:46 UTC
Permalink
Is your suggestion to live with the failing regression at the moment? To put it differently: is there something I can/must do to assist with solving this issue or can I keep on going and leave this to a Ruby-expert (read Brad or Nilay) to sort out?

Andreas

-----Original Message-----
From: Beckmann, Brad [mailto:Brad.Beckmann-***@public.gmane.org]
Sent: 06 January 2012 23:58
To: Nilay Vaish
Cc: Andreas Hansson; Ali Saidi; gem5 Developer List (gem5-dev-1Gs4CP2/***@public.gmane.org)
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
I think we should try to understand as to why this problem is occurring in first
place. Andreas, in one of the earlier emails, mentioned that these memory-
system patches do not introduce any timing changes. The only other reason I
can think of why this test is failing, is that these accesses did not used to go
through Ruby earlier. This seems strange, but may be that is true.
The problem occurs because of a race between timing requests and function requests that come an emulated system call that doesn't appear to have been modified in years. I doubt there is anything in Andreas's patches that directly cause this problem. They probably just reorder the requests in a particular way that now cause the rare race to occur with the hammer protocol. Having a functional access race with a timing writeback seems like a very rare situation. I'm not surprised we haven't seen this before.
Andreas, is it possible for you to figure out what particular change in the
memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it might be all
right to change the permissions in this case. We might want to switch to a
single copy of each cache block in order to avoid this problem. Do we really
need the data to reside in the interconnection network to carry out a
simulation? Can we not have fake data in the network and the actual data
always resides at one single place?
I'd rather not remove data from the interconnect. That is certainly not in the spirit of "execute at execute". Having data exist in one single place is what we do today with Ruby's backing copy of physmem. If we have data always reside in one single place, then we might as well remove all of Ruby's functional access support and go back to just sending all functional accesses to physmem.

For the particular problem we're seeing today, data is not stuck in the interconnection network. Rather it is just stuck in the DRAM request queue that simulates the timing of the DRAM interface. The data itself has already been written to DirectoryMemory.

Overall, I'm not happy with any solution that comes to my mind. I don't like having to deal with these problems one-by-one, nor do I want to remove Ruby's functional access support. I also don't want to have to build some sort of complicated mechanism that tries to identify valid data floating in any Ruby buffer (network, DRAM, etc.) because I don't see how one can do that without putting a lot of burden/restriction on the protocol writer.

Brad




-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Nilay
2012-01-09 15:13:02 UTC
Permalink
Andreas, in the file src/mem/protocol/MOESI_hammer-dir.sm, set the access
permission for state WB_E_W to Read_Write, instead of Busy, the current
set permission. See if this helps in removing the error.

--
Nilay
Post by Andreas Hansson
Is your suggestion to live with the failing regression at the moment? To
put it differently: is there something I can/must do to assist with
solving this issue or can I keep on going and leave this to a Ruby-expert
(read Brad or Nilay) to sort out?
Andreas
-----Original Message-----
Sent: 06 January 2012 23:58
To: Nilay Vaish
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
I think we should try to understand as to why this problem is occurring in first
place. Andreas, in one of the earlier emails, mentioned that these memory-
system patches do not introduce any timing changes. The only other reason I
can think of why this test is failing, is that these accesses did not used to go
through Ruby earlier. This seems strange, but may be that is true.
The problem occurs because of a race between timing requests and function
requests that come an emulated system call that doesn't appear to have
been modified in years. I doubt there is anything in Andreas's patches
that directly cause this problem. They probably just reorder the requests
in a particular way that now cause the rare race to occur with the hammer
protocol. Having a functional access race with a timing writeback seems
like a very rare situation. I'm not surprised we haven't seen this
before.
Andreas, is it possible for you to figure out what particular change in the
memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it might be all
right to change the permissions in this case. We might want to switch to a
single copy of each cache block in order to avoid this problem. Do we really
need the data to reside in the interconnection network to carry out a
simulation? Can we not have fake data in the network and the actual data
always resides at one single place?
I'd rather not remove data from the interconnect. That is certainly not
in the spirit of "execute at execute". Having data exist in one single
place is what we do today with Ruby's backing copy of physmem. If we have
data always reside in one single place, then we might as well remove all
of Ruby's functional access support and go back to just sending all
functional accesses to physmem.
For the particular problem we're seeing today, data is not stuck in the
interconnection network. Rather it is just stuck in the DRAM request
queue that simulates the timing of the DRAM interface. The data itself
has already been written to DirectoryMemory.
Overall, I'm not happy with any solution that comes to my mind. I don't
like having to deal with these problems one-by-one, nor do I want to
remove Ruby's functional access support. I also don't want to have to
build some sort of complicated mechanism that tries to identify valid data
floating in any Ruby buffer (network, DRAM, etc.) because I don't see how
one can do that without putting a lot of burden/restriction on the
protocol writer.
Brad
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy the
information in any medium. Thank you.
Andreas Hansson
2012-01-09 15:39:57 UTC
Permalink
Hi Nilay,

Thanks! With the suggested change (Busy->Read_Write) the regression passes without any errors. Are you suggesting I make this modification a part of the existing patch (port proxy introduction) or shall we address this as a separate patch to ensure there are no undesirable side effects?

Andreas

-----Original Message-----
From: Nilay [mailto:nilay-***@public.gmane.org]
Sent: 09 January 2012 15:13
To: Andreas Hansson
Cc: Beckmann, Brad; Ali Saidi; gem5 Developer List (gem5-dev-1Gs4CP2/***@public.gmane.org)
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches

Andreas, in the file src/mem/protocol/MOESI_hammer-dir.sm, set the access
permission for state WB_E_W to Read_Write, instead of Busy, the current
set permission. See if this helps in removing the error.

--
Nilay
Post by Andreas Hansson
Is your suggestion to live with the failing regression at the moment? To
put it differently: is there something I can/must do to assist with
solving this issue or can I keep on going and leave this to a Ruby-expert
(read Brad or Nilay) to sort out?
Andreas
-----Original Message-----
Sent: 06 January 2012 23:58
To: Nilay Vaish
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
I think we should try to understand as to why this problem is occurring in first
place. Andreas, in one of the earlier emails, mentioned that these memory-
system patches do not introduce any timing changes. The only other reason I
can think of why this test is failing, is that these accesses did not used to go
through Ruby earlier. This seems strange, but may be that is true.
The problem occurs because of a race between timing requests and function
requests that come an emulated system call that doesn't appear to have
been modified in years. I doubt there is anything in Andreas's patches
that directly cause this problem. They probably just reorder the requests
in a particular way that now cause the rare race to occur with the hammer
protocol. Having a functional access race with a timing writeback seems
like a very rare situation. I'm not surprised we haven't seen this
before.
Andreas, is it possible for you to figure out what particular change in the
memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it might be all
right to change the permissions in this case. We might want to switch to a
single copy of each cache block in order to avoid this problem. Do we really
need the data to reside in the interconnection network to carry out a
simulation? Can we not have fake data in the network and the actual data
always resides at one single place?
I'd rather not remove data from the interconnect. That is certainly not
in the spirit of "execute at execute". Having data exist in one single
place is what we do today with Ruby's backing copy of physmem. If we have
data always reside in one single place, then we might as well remove all
of Ruby's functional access support and go back to just sending all
functional accesses to physmem.
For the particular problem we're seeing today, data is not stuck in the
interconnection network. Rather it is just stuck in the DRAM request
queue that simulates the timing of the DRAM interface. The data itself
has already been written to DirectoryMemory.
Overall, I'm not happy with any solution that comes to my mind. I don't
like having to deal with these problems one-by-one, nor do I want to
remove Ruby's functional access support. I also don't want to have to
build some sort of complicated mechanism that tries to identify valid data
floating in any Ruby buffer (network, DRAM, etc.) because I don't see how
one can do that without putting a lot of burden/restriction on the
protocol writer.
Brad
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy the
information in any medium. Thank you.
-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Beckmann, Brad
2012-01-09 19:18:01 UTC
Permalink
Make it a separate patch please. There are probably 100s of similar races that still exist in Ruby. Some of them can be fixed by simply better defining the access permissions (like the current fix), but other can't be fixed given the current infrastructure. My email last week was lamenting the fact that this fix is very unsatisfying.

So just to through an idea out here...What do people think about supporting Ruby functional accesses by launching a separate thread? If the eventual plan is to have gem5 multi-threaded with each thread having a separate eventqueue, then one could imagine launching a thread that transforms a functional access as a timing access that utilizes its own eventqueue, thus the original thread's call stack and eventqueue is unperturbed. I know it sounds a little crazy and who knows when multi-threaded support will actually exist. However, it would provide Ruby functional access support without requiring a bunch of access permission fixes or protocol redesign.

Brad
-----Original Message-----
Sent: Monday, January 09, 2012 7:40 AM
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Hi Nilay,
Thanks! With the suggested change (Busy->Read_Write) the regression
passes without any errors. Are you suggesting I make this modification a part
of the existing patch (port proxy introduction) or shall we address this as a
separate patch to ensure there are no undesirable side effects?
Andreas
-----Original Message-----
Sent: 09 January 2012 15:13
To: Andreas Hansson
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Andreas, in the file src/mem/protocol/MOESI_hammer-dir.sm, set the
access permission for state WB_E_W to Read_Write, instead of Busy, the
current set permission. See if this helps in removing the error.
--
Nilay
Post by Andreas Hansson
Is your suggestion to live with the failing regression at the moment?
To put it differently: is there something I can/must do to assist with
solving this issue or can I keep on going and leave this to a
Ruby-expert (read Brad or Nilay) to sort out?
Andreas
-----Original Message-----
Sent: 06 January 2012 23:58
To: Nilay Vaish
Cc: Andreas Hansson; Ali Saidi; gem5 Developer List
Subject: RE: [gem5-dev] One failing Ruby regression after
memory-system patches
Post by Nilay Vaish
I think we should try to understand as to why this problem is
occurring in first place. Andreas, in one of the earlier emails,
mentioned that these
memory-
system patches do not introduce any timing changes. The only other
reason I can think of why this test is failing, is that these
accesses did not used to go through Ruby earlier. This seems strange,
but may be that is true.
The problem occurs because of a race between timing requests and
function requests that come an emulated system call that doesn't
appear to have been modified in years. I doubt there is anything in
Andreas's patches that directly cause this problem. They probably
just reorder the requests in a particular way that now cause the rare
race to occur with the hammer protocol. Having a functional access
race with a timing writeback seems like a very rare situation. I'm
not surprised we haven't seen this before.
Post by Nilay Vaish
Andreas, is it possible for you to figure out what particular change
in the memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it might
be all right to change the permissions in this case. We might want to
switch to a single copy of each cache block in order to avoid this
problem. Do we really need the data to reside in the interconnection
network to carry out a simulation? Can we not have fake data in the
network and the actual data always resides at one single place?
I'd rather not remove data from the interconnect. That is certainly
not in the spirit of "execute at execute". Having data exist in one
single place is what we do today with Ruby's backing copy of physmem.
If we have data always reside in one single place, then we might as
well remove all of Ruby's functional access support and go back to
just sending all functional accesses to physmem.
For the particular problem we're seeing today, data is not stuck in
the interconnection network. Rather it is just stuck in the DRAM
request queue that simulates the timing of the DRAM interface. The
data itself has already been written to DirectoryMemory.
Overall, I'm not happy with any solution that comes to my mind. I
don't like having to deal with these problems one-by-one, nor do I
want to remove Ruby's functional access support. I also don't want to
have to build some sort of complicated mechanism that tries to
identify valid data floating in any Ruby buffer (network, DRAM, etc.)
because I don't see how one can do that without putting a lot of
burden/restriction on the protocol writer.
Brad
-- IMPORTANT NOTICE: The contents of this email and any attachments
are confidential and may also be privileged. If you are not the
intended recipient, please notify the sender immediately and do not
disclose the contents to any other person, use it for any purpose, or
store or copy the information in any medium. Thank you.
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium. Thank you.
Gabriel Michael Black
2012-01-09 20:53:53 UTC
Permalink
I think a new thread and a new event queue are independent. I don't
like how we're already adding something to run time forward and then
throw things away and roll back time. Time should be monotonically
increasing.

Gabe
Post by Beckmann, Brad
Make it a separate patch please. There are probably 100s of similar
races that still exist in Ruby. Some of them can be fixed by simply
better defining the access permissions (like the current fix), but
other can't be fixed given the current infrastructure. My email
last week was lamenting the fact that this fix is very unsatisfying.
So just to through an idea out here...What do people think about
supporting Ruby functional accesses by launching a separate thread?
If the eventual plan is to have gem5 multi-threaded with each thread
having a separate eventqueue, then one could imagine launching a
thread that transforms a functional access as a timing access that
utilizes its own eventqueue, thus the original thread's call stack
and eventqueue is unperturbed. I know it sounds a little crazy and
who knows when multi-threaded support will actually exist. However,
it would provide Ruby functional access support without requiring a
bunch of access permission fixes or protocol redesign.
Brad
-----Original Message-----
Sent: Monday, January 09, 2012 7:40 AM
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Hi Nilay,
Thanks! With the suggested change (Busy->Read_Write) the regression
passes without any errors. Are you suggesting I make this
modification a part
of the existing patch (port proxy introduction) or shall we address this as a
separate patch to ensure there are no undesirable side effects?
Andreas
-----Original Message-----
Sent: 09 January 2012 15:13
To: Andreas Hansson
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Andreas, in the file src/mem/protocol/MOESI_hammer-dir.sm, set the
access permission for state WB_E_W to Read_Write, instead of Busy, the
current set permission. See if this helps in removing the error.
--
Nilay
Post by Andreas Hansson
Is your suggestion to live with the failing regression at the moment?
To put it differently: is there something I can/must do to assist with
solving this issue or can I keep on going and leave this to a
Ruby-expert (read Brad or Nilay) to sort out?
Andreas
-----Original Message-----
Sent: 06 January 2012 23:58
To: Nilay Vaish
Cc: Andreas Hansson; Ali Saidi; gem5 Developer List
Subject: RE: [gem5-dev] One failing Ruby regression after
memory-system patches
Post by Nilay Vaish
I think we should try to understand as to why this problem is
occurring in first place. Andreas, in one of the earlier emails,
mentioned that these
memory-
system patches do not introduce any timing changes. The only other
reason I can think of why this test is failing, is that these
accesses did not used to go through Ruby earlier. This seems strange,
but may be that is true.
The problem occurs because of a race between timing requests and
function requests that come an emulated system call that doesn't
appear to have been modified in years. I doubt there is anything in
Andreas's patches that directly cause this problem. They probably
just reorder the requests in a particular way that now cause the rare
race to occur with the hammer protocol. Having a functional access
race with a timing writeback seems like a very rare situation. I'm
not surprised we haven't seen this before.
Post by Nilay Vaish
Andreas, is it possible for you to figure out what particular change
in the memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it might
be all right to change the permissions in this case. We might want to
switch to a single copy of each cache block in order to avoid this
problem. Do we really need the data to reside in the interconnection
network to carry out a simulation? Can we not have fake data in the
network and the actual data always resides at one single place?
I'd rather not remove data from the interconnect. That is certainly
not in the spirit of "execute at execute". Having data exist in one
single place is what we do today with Ruby's backing copy of physmem.
If we have data always reside in one single place, then we might as
well remove all of Ruby's functional access support and go back to
just sending all functional accesses to physmem.
For the particular problem we're seeing today, data is not stuck in
the interconnection network. Rather it is just stuck in the DRAM
request queue that simulates the timing of the DRAM interface. The
data itself has already been written to DirectoryMemory.
Overall, I'm not happy with any solution that comes to my mind. I
don't like having to deal with these problems one-by-one, nor do I
want to remove Ruby's functional access support. I also don't want to
have to build some sort of complicated mechanism that tries to
identify valid data floating in any Ruby buffer (network, DRAM, etc.)
because I don't see how one can do that without putting a lot of
burden/restriction on the protocol writer.
Brad
-- IMPORTANT NOTICE: The contents of this email and any attachments
are confidential and may also be privileged. If you are not the
intended recipient, please notify the sender immediately and do not
disclose the contents to any other person, use it for any purpose, or
store or copy the information in any medium. Thank you.
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the
intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the
information in any
medium. Thank you.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
nathan binkert
2012-01-09 21:08:26 UTC
Permalink
I agree. Seems pretty dangerous to me.

Nate

On Mon, Jan 9, 2012 at 12:53 PM, Gabriel Michael Black
I think a new thread and a new event queue are independent. I don't like how
we're already adding something to run time forward and then throw things
away and roll back time. Time should be monotonically increasing.
Gabe
Make it a separate patch please.  There are probably 100s of similar races
that still exist in Ruby.  Some of them can be fixed by simply better
defining the access permissions (like the current fix), but other can't be
fixed given the current infrastructure.  My email last week was lamenting
the fact that this fix is very unsatisfying.
So just to through an idea out here...What do people think about
supporting Ruby functional accesses by launching a separate thread?  If the
eventual plan is to have gem5 multi-threaded with each thread having a
separate eventqueue, then one could imagine launching a thread that
transforms a functional access as a timing access that utilizes its own
eventqueue, thus the original thread's call stack and eventqueue is
unperturbed.  I know it sounds a little crazy and who knows when
multi-threaded support will actually exist.  However, it would provide Ruby
functional access support without requiring a bunch of access permission
fixes or protocol redesign.
Brad
-----Original Message-----
Sent: Monday, January 09, 2012 7:40 AM
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Hi Nilay,
Thanks! With the suggested change (Busy->Read_Write) the regression
passes without any errors. Are you suggesting I make this modification a part
of the existing patch (port proxy introduction) or shall we address this as a
separate patch to ensure there are no undesirable side effects?
Andreas
-----Original Message-----
Sent: 09 January 2012 15:13
To: Andreas Hansson
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Andreas, in the file src/mem/protocol/MOESI_hammer-dir.sm, set the
access permission for state WB_E_W to Read_Write, instead of Busy, the
current set permission. See if this helps in removing the error.
--
Nilay
Post by Andreas Hansson
Is your suggestion to live with the failing regression at the moment?
To put it differently: is there something I can/must do to assist with
solving this issue or can I keep on going and leave this to a
Ruby-expert (read Brad or Nilay) to sort out?
Andreas
-----Original Message-----
Sent: 06 January 2012 23:58
To: Nilay Vaish
Cc: Andreas Hansson; Ali Saidi; gem5 Developer List
Subject: RE: [gem5-dev] One failing Ruby regression after
memory-system patches
Post by Nilay Vaish
I think we should try to understand as to why this problem is
occurring in first place. Andreas, in one of the earlier emails,
mentioned that these
memory-
system patches do not introduce any timing changes. The only other
reason I can think of why this test is failing, is that these
accesses did not used to go through Ruby earlier. This seems strange,
but may be that is true.
The problem occurs because of a race between timing requests and
function requests that come an emulated system call that doesn't
appear to have been modified in years.  I doubt there is anything in
Andreas's patches that directly cause this problem.  They probably
just reorder the requests in a particular way that now cause the rare
race to occur with the hammer protocol.  Having a functional access
race with a timing writeback seems like a very rare situation.  I'm
not surprised we haven't seen this before.
Post by Nilay Vaish
Andreas, is it possible for you to figure out what particular change
in the memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it might
be all right to change the permissions in this case. We might want to
switch to a single copy of each cache block in order to avoid this
problem. Do we really need the data to reside in the interconnection
network to carry out a simulation? Can we not have fake data in the
network and the actual data always resides at one single place?
I'd rather not remove data from the interconnect.  That is certainly
not in the spirit of "execute at execute".  Having data exist in one
single place is what we do today with Ruby's backing copy of physmem.
If we have data always reside in one single place, then we might as
well remove all of Ruby's functional access support and go back to
just sending all functional accesses to physmem.
For the particular problem we're seeing today, data is not stuck in
the interconnection network.  Rather it is just stuck in the DRAM
request queue that simulates the timing of the DRAM interface.  The
data itself has already been written to DirectoryMemory.
Overall, I'm not happy with any solution that comes to my mind.  I
don't like having to deal with these problems one-by-one, nor do I
want to remove Ruby's functional access support.  I also don't want to
have to build some sort of complicated mechanism that tries to
identify valid data floating in any Ruby buffer (network, DRAM, etc.)
because I don't see how one can do that without putting a lot of
burden/restriction on the protocol writer.
Brad
-- IMPORTANT NOTICE: The contents of this email and any attachments
are confidential and may also be privileged. If you are not the
intended recipient, please notify the sender immediately and do not
disclose the contents to any other person, use it for any purpose, or
store or copy the information in any medium.  Thank you.
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the information in any
medium.  Thank you.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Steve Reinhardt
2012-01-09 21:35:39 UTC
Permalink
If we want to do something involving threads, what we should do is
implement a "user-level thread" capability that lets you take a
single-threaded piece of code, suspend it while you go run some other
events, and then have a later event resume that "thread". Gabe knows what
I'm talking about ;-).

This doesn't fix functional accesses directly, but it removes a lot of the
motivation for them, which is to avoid making things like syscall emulation
functions ridiculously complex. With this capability, you could switch all
the syscall emulation functions over to use timing accesses.

Steve
Post by nathan binkert
I agree. Seems pretty dangerous to me.
Nate
On Mon, Jan 9, 2012 at 12:53 PM, Gabriel Michael Black
I think a new thread and a new event queue are independent. I don't like
how
we're already adding something to run time forward and then throw things
away and roll back time. Time should be monotonically increasing.
Gabe
Post by Beckmann, Brad
Make it a separate patch please. There are probably 100s of similar
races
Post by Beckmann, Brad
that still exist in Ruby. Some of them can be fixed by simply better
defining the access permissions (like the current fix), but other can't
be
Post by Beckmann, Brad
fixed given the current infrastructure. My email last week was
lamenting
Post by Beckmann, Brad
the fact that this fix is very unsatisfying.
So just to through an idea out here...What do people think about
supporting Ruby functional accesses by launching a separate thread? If
the
Post by Beckmann, Brad
eventual plan is to have gem5 multi-threaded with each thread having a
separate eventqueue, then one could imagine launching a thread that
transforms a functional access as a timing access that utilizes its own
eventqueue, thus the original thread's call stack and eventqueue is
unperturbed. I know it sounds a little crazy and who knows when
multi-threaded support will actually exist. However, it would provide
Ruby
Post by Beckmann, Brad
functional access support without requiring a bunch of access permission
fixes or protocol redesign.
Brad
-----Original Message-----
Sent: Monday, January 09, 2012 7:40 AM
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Hi Nilay,
Thanks! With the suggested change (Busy->Read_Write) the regression
passes without any errors. Are you suggesting I make this modification
a
Post by Beckmann, Brad
part
of the existing patch (port proxy introduction) or shall we address
this
Post by Beckmann, Brad
as a
separate patch to ensure there are no undesirable side effects?
Andreas
-----Original Message-----
Sent: 09 January 2012 15:13
To: Andreas Hansson
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Andreas, in the file src/mem/protocol/MOESI_hammer-dir.sm, set the
access permission for state WB_E_W to Read_Write, instead of Busy, the
current set permission. See if this helps in removing the error.
--
Nilay
Post by Andreas Hansson
Is your suggestion to live with the failing regression at the moment?
To put it differently: is there something I can/must do to assist
with
Post by Beckmann, Brad
Post by Andreas Hansson
solving this issue or can I keep on going and leave this to a
Ruby-expert (read Brad or Nilay) to sort out?
Andreas
-----Original Message-----
Sent: 06 January 2012 23:58
To: Nilay Vaish
Cc: Andreas Hansson; Ali Saidi; gem5 Developer List
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Post by Nilay Vaish
I think we should try to understand as to why this problem is
occurring in first place. Andreas, in one of the earlier emails,
mentioned that these
memory-
system patches do not introduce any timing changes. The only other
reason I can think of why this test is failing, is that these
accesses did not used to go through Ruby earlier. This seems
strange,
Post by Beckmann, Brad
Post by Andreas Hansson
Post by Nilay Vaish
but may be that is true.
The problem occurs because of a race between timing requests and
function requests that come an emulated system call that doesn't
appear to have been modified in years. I doubt there is anything in
Andreas's patches that directly cause this problem. They probably
just reorder the requests in a particular way that now cause the rare
race to occur with the hammer protocol. Having a functional access
race with a timing writeback seems like a very rare situation. I'm
not surprised we haven't seen this before.
Post by Nilay Vaish
Andreas, is it possible for you to figure out what particular change
in the memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it
might
Post by Beckmann, Brad
Post by Andreas Hansson
Post by Nilay Vaish
be all right to change the permissions in this case. We might want
to
Post by Beckmann, Brad
Post by Andreas Hansson
Post by Nilay Vaish
switch to a single copy of each cache block in order to avoid this
problem. Do we really need the data to reside in the interconnection
network to carry out a simulation? Can we not have fake data in the
network and the actual data always resides at one single place?
I'd rather not remove data from the interconnect. That is certainly
not in the spirit of "execute at execute". Having data exist in one
single place is what we do today with Ruby's backing copy of physmem.
If we have data always reside in one single place, then we might as
well remove all of Ruby's functional access support and go back to
just sending all functional accesses to physmem.
For the particular problem we're seeing today, data is not stuck in
the interconnection network. Rather it is just stuck in the DRAM
request queue that simulates the timing of the DRAM interface. The
data itself has already been written to DirectoryMemory.
Overall, I'm not happy with any solution that comes to my mind. I
don't like having to deal with these problems one-by-one, nor do I
want to remove Ruby's functional access support. I also don't want
to
Post by Beckmann, Brad
Post by Andreas Hansson
have to build some sort of complicated mechanism that tries to
identify valid data floating in any Ruby buffer (network, DRAM, etc.)
because I don't see how one can do that without putting a lot of
burden/restriction on the protocol writer.
Brad
-- IMPORTANT NOTICE: The contents of this email and any attachments
are confidential and may also be privileged. If you are not the
intended recipient, please notify the sender immediately and do not
disclose the contents to any other person, use it for any purpose, or
store or copy the information in any medium. Thank you.
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents
to
Post by Beckmann, Brad
any
other person, use it for any purpose, or store or copy the information
in
Post by Beckmann, Brad
any
medium. Thank you.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabriel Michael Black
2012-01-10 01:13:14 UTC
Permalink
This is mostly true and I wouldn't be apposed to (re)implementing code
like that, but there are other uses of functional access like
verification by the checker, binary loading, pseudo instructions,
remote gdb, address translation done by the simulator itself, etc.
While that would make it easier to get away without functional
accesses in some cases, I expect some others would still be quite
awkward or maybe impossible, especially if they're not supposed to
perturb a running simulation.

Gabe
Post by Steve Reinhardt
If we want to do something involving threads, what we should do is
implement a "user-level thread" capability that lets you take a
single-threaded piece of code, suspend it while you go run some other
events, and then have a later event resume that "thread". Gabe knows what
I'm talking about ;-).
This doesn't fix functional accesses directly, but it removes a lot of the
motivation for them, which is to avoid making things like syscall emulation
functions ridiculously complex. With this capability, you could switch all
the syscall emulation functions over to use timing accesses.
Steve
Post by nathan binkert
I agree. Seems pretty dangerous to me.
Nate
On Mon, Jan 9, 2012 at 12:53 PM, Gabriel Michael Black
I think a new thread and a new event queue are independent. I don't like
how
we're already adding something to run time forward and then throw things
away and roll back time. Time should be monotonically increasing.
Gabe
Post by Beckmann, Brad
Make it a separate patch please. There are probably 100s of similar
races
Post by Beckmann, Brad
that still exist in Ruby. Some of them can be fixed by simply better
defining the access permissions (like the current fix), but other can't
be
Post by Beckmann, Brad
fixed given the current infrastructure. My email last week was
lamenting
Post by Beckmann, Brad
the fact that this fix is very unsatisfying.
So just to through an idea out here...What do people think about
supporting Ruby functional accesses by launching a separate thread? If
the
Post by Beckmann, Brad
eventual plan is to have gem5 multi-threaded with each thread having a
separate eventqueue, then one could imagine launching a thread that
transforms a functional access as a timing access that utilizes its own
eventqueue, thus the original thread's call stack and eventqueue is
unperturbed. I know it sounds a little crazy and who knows when
multi-threaded support will actually exist. However, it would provide
Ruby
Post by Beckmann, Brad
functional access support without requiring a bunch of access permission
fixes or protocol redesign.
Brad
-----Original Message-----
Sent: Monday, January 09, 2012 7:40 AM
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Hi Nilay,
Thanks! With the suggested change (Busy->Read_Write) the regression
passes without any errors. Are you suggesting I make this modification
a
Post by Beckmann, Brad
part
of the existing patch (port proxy introduction) or shall we address
this
Post by Beckmann, Brad
as a
separate patch to ensure there are no undesirable side effects?
Andreas
-----Original Message-----
Sent: 09 January 2012 15:13
To: Andreas Hansson
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Andreas, in the file src/mem/protocol/MOESI_hammer-dir.sm, set the
access permission for state WB_E_W to Read_Write, instead of Busy, the
current set permission. See if this helps in removing the error.
--
Nilay
Post by Andreas Hansson
Is your suggestion to live with the failing regression at the moment?
To put it differently: is there something I can/must do to assist
with
Post by Beckmann, Brad
Post by Andreas Hansson
solving this issue or can I keep on going and leave this to a
Ruby-expert (read Brad or Nilay) to sort out?
Andreas
-----Original Message-----
Sent: 06 January 2012 23:58
To: Nilay Vaish
Cc: Andreas Hansson; Ali Saidi; gem5 Developer List
Subject: RE: [gem5-dev] One failing Ruby regression after
memory-system patches
Post by Nilay Vaish
I think we should try to understand as to why this problem is
occurring in first place. Andreas, in one of the earlier emails,
mentioned that these
memory-
system patches do not introduce any timing changes. The only other
reason I can think of why this test is failing, is that these
accesses did not used to go through Ruby earlier. This seems
strange,
Post by Beckmann, Brad
Post by Andreas Hansson
Post by Nilay Vaish
but may be that is true.
The problem occurs because of a race between timing requests and
function requests that come an emulated system call that doesn't
appear to have been modified in years. I doubt there is anything in
Andreas's patches that directly cause this problem. They probably
just reorder the requests in a particular way that now cause the rare
race to occur with the hammer protocol. Having a functional access
race with a timing writeback seems like a very rare situation. I'm
not surprised we haven't seen this before.
Post by Nilay Vaish
Andreas, is it possible for you to figure out what particular change
in the memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it
might
Post by Beckmann, Brad
Post by Andreas Hansson
Post by Nilay Vaish
be all right to change the permissions in this case. We might want
to
Post by Beckmann, Brad
Post by Andreas Hansson
Post by Nilay Vaish
switch to a single copy of each cache block in order to avoid this
problem. Do we really need the data to reside in the interconnection
network to carry out a simulation? Can we not have fake data in the
network and the actual data always resides at one single place?
I'd rather not remove data from the interconnect. That is certainly
not in the spirit of "execute at execute". Having data exist in one
single place is what we do today with Ruby's backing copy of physmem.
If we have data always reside in one single place, then we might as
well remove all of Ruby's functional access support and go back to
just sending all functional accesses to physmem.
For the particular problem we're seeing today, data is not stuck in
the interconnection network. Rather it is just stuck in the DRAM
request queue that simulates the timing of the DRAM interface. The
data itself has already been written to DirectoryMemory.
Overall, I'm not happy with any solution that comes to my mind. I
don't like having to deal with these problems one-by-one, nor do I
want to remove Ruby's functional access support. I also don't want
to
Post by Beckmann, Brad
Post by Andreas Hansson
have to build some sort of complicated mechanism that tries to
identify valid data floating in any Ruby buffer (network, DRAM, etc.)
because I don't see how one can do that without putting a lot of
burden/restriction on the protocol writer.
Brad
-- IMPORTANT NOTICE: The contents of this email and any attachments
are confidential and may also be privileged. If you are not the
intended recipient, please notify the sender immediately and do not
disclose the contents to any other person, use it for any purpose, or
store or copy the information in any medium. Thank you.
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended recipient,
please notify the sender immediately and do not disclose the contents
to
Post by Beckmann, Brad
any
other person, use it for any purpose, or store or copy the information
in
Post by Beckmann, Brad
any
medium. Thank you.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Andreas Hansson
2012-01-10 10:04:27 UTC
Permalink
I am also not very fond of the idea of using timing accesses (with side effects) to mimic the functional accesses. Ultimately the functional access should be simple and fast. Would it help if every buffering component in the system was registered in a central location (e.g. system) and thus allowed some form of iteration without traversing the system structure?

Andreas


-----Original Message-----
From: gem5-dev-bounces-1Gs4CP2/***@public.gmane.org [mailto:gem5-dev-bounces-1Gs4CP2/***@public.gmane.org] On Behalf Of Gabriel Michael Black
Sent: 09 January 2012 20:54
To: gem5-dev-1Gs4CP2/***@public.gmane.org
Subject: Re: [gem5-dev] One failing Ruby regression after memory-system patches

I think a new thread and a new event queue are independent. I don't
like how we're already adding something to run time forward and then
throw things away and roll back time. Time should be monotonically
increasing.

Gabe
Post by Beckmann, Brad
Make it a separate patch please. There are probably 100s of similar
races that still exist in Ruby. Some of them can be fixed by simply
better defining the access permissions (like the current fix), but
other can't be fixed given the current infrastructure. My email
last week was lamenting the fact that this fix is very unsatisfying.
So just to through an idea out here...What do people think about
supporting Ruby functional accesses by launching a separate thread?
If the eventual plan is to have gem5 multi-threaded with each thread
having a separate eventqueue, then one could imagine launching a
thread that transforms a functional access as a timing access that
utilizes its own eventqueue, thus the original thread's call stack
and eventqueue is unperturbed. I know it sounds a little crazy and
who knows when multi-threaded support will actually exist. However,
it would provide Ruby functional access support without requiring a
bunch of access permission fixes or protocol redesign.
Brad
-----Original Message-----
Sent: Monday, January 09, 2012 7:40 AM
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Hi Nilay,
Thanks! With the suggested change (Busy->Read_Write) the regression
passes without any errors. Are you suggesting I make this
modification a part
of the existing patch (port proxy introduction) or shall we address this as a
separate patch to ensure there are no undesirable side effects?
Andreas
-----Original Message-----
Sent: 09 January 2012 15:13
To: Andreas Hansson
Subject: RE: [gem5-dev] One failing Ruby regression after memory-system patches
Andreas, in the file src/mem/protocol/MOESI_hammer-dir.sm, set the
access permission for state WB_E_W to Read_Write, instead of Busy, the
current set permission. See if this helps in removing the error.
--
Nilay
Post by Andreas Hansson
Is your suggestion to live with the failing regression at the moment?
To put it differently: is there something I can/must do to assist with
solving this issue or can I keep on going and leave this to a
Ruby-expert (read Brad or Nilay) to sort out?
Andreas
-----Original Message-----
Sent: 06 January 2012 23:58
To: Nilay Vaish
Cc: Andreas Hansson; Ali Saidi; gem5 Developer List
Subject: RE: [gem5-dev] One failing Ruby regression after
memory-system patches
Post by Nilay Vaish
I think we should try to understand as to why this problem is
occurring in first place. Andreas, in one of the earlier emails,
mentioned that these
memory-
system patches do not introduce any timing changes. The only other
reason I can think of why this test is failing, is that these
accesses did not used to go through Ruby earlier. This seems strange,
but may be that is true.
The problem occurs because of a race between timing requests and
function requests that come an emulated system call that doesn't
appear to have been modified in years. I doubt there is anything in
Andreas's patches that directly cause this problem. They probably
just reorder the requests in a particular way that now cause the rare
race to occur with the hammer protocol. Having a functional access
race with a timing writeback seems like a very rare situation. I'm
not surprised we haven't seen this before.
Post by Nilay Vaish
Andreas, is it possible for you to figure out what particular change
in the memory system is making this test fail?
Whether or not that particular state can have Read_Write permissions
depends on the protocol itself. A quick glance tells me that it might
be all right to change the permissions in this case. We might want to
switch to a single copy of each cache block in order to avoid this
problem. Do we really need the data to reside in the interconnection
network to carry out a simulation? Can we not have fake data in the
network and the actual data always resides at one single place?
I'd rather not remove data from the interconnect. That is certainly
not in the spirit of "execute at execute". Having data exist in one
single place is what we do today with Ruby's backing copy of physmem.
If we have data always reside in one single place, then we might as
well remove all of Ruby's functional access support and go back to
just sending all functional accesses to physmem.
For the particular problem we're seeing today, data is not stuck in
the interconnection network. Rather it is just stuck in the DRAM
request queue that simulates the timing of the DRAM interface. The
data itself has already been written to DirectoryMemory.
Overall, I'm not happy with any solution that comes to my mind. I
don't like having to deal with these problems one-by-one, nor do I
want to remove Ruby's functional access support. I also don't want to
have to build some sort of complicated mechanism that tries to
identify valid data floating in any Ruby buffer (network, DRAM, etc.)
because I don't see how one can do that without putting a lot of
burden/restriction on the protocol writer.
Brad
-- IMPORTANT NOTICE: The contents of this email and any attachments
are confidential and may also be privileged. If you are not the
intended recipient, please notify the sender immediately and do not
disclose the contents to any other person, use it for any purpose, or
store or copy the information in any medium. Thank you.
-- IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the
intended recipient,
please notify the sender immediately and do not disclose the contents to any
other person, use it for any purpose, or store or copy the
information in any
medium. Thank you.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
gem5-dev-1Gs4CP2/***@public.gmane.org
http://m5sim.org/mailman/listinfo/gem5-dev


-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
Continue reading on narkive:
Loading...