Discussion:
[gem5-dev] Ruby serialize removing event queue head
(too old to reply)
Timothy M Jones
2015-06-11 19:48:12 UTC
Permalink
Hello,

Could someone tell me why we need to take the head event off the event
queue in RubySystem::serialize() in src/mem/ruby/system/System.cc?

Event* eventq_head = eventq->replaceHead(NULL);

The problem I'm getting is that when simulate() is called a few lines
later, it tries to reschedule the simulate_limit_event, but that causes
a panic because it's no longer on the event queue. This is happening
when trying to take a checkpoint with ruby. I can't work out from the
comments why the head event needs to be taken off in the first place.

This is basically the reason behind the problems in this thread:

https://www.mail-archive.com/gem5-***@gem5.org/msg11701.html

Thanks
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
Timothy M Jones
2015-06-13 08:18:31 UTC
Permalink
Hi again,

Further to this message, I've used hg bisect to find the revision that
breaks checkpointing with ruby. It's revision 10524 that Nilay
committed in November that's the first bad changeset. It fails with the
panic() on the missing event that I wrote about previously.

I've scanned through the diff and can't immediately see any reason why
this would break serialisation, although it does remove some of the code
to serialise ruby state.

Could anyone (Nilay?) give me a hint as to why this might break
checkpointing with ruby?

I've compiled with the MOESI_hammer protocol for x86, then run with this
command line:

./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir>
configs/example/fs.py -n 1 --kernel <my-kernel> --script
configs/boot/hack_back_ckpt.rcS --max-checkpoints 1 --checkpoint-dir
<cptdir> --disk-image <my-disk-image> --cpu-type timing --restore-with
timing --ruby

Any help would be appreciated. I don't know ruby at all, so trying to
work out what's going on is slow....

Cheers
Tim
Post by Timothy M Jones
Hello,
Could someone tell me why we need to take the head event off the event
queue in RubySystem::serialize() in src/mem/ruby/system/System.cc?
Event* eventq_head = eventq->replaceHead(NULL);
The problem I'm getting is that when simulate() is called a few lines
later, it tries to reschedule the simulate_limit_event, but that causes
a panic because it's no longer on the event queue. This is happening
when trying to take a checkpoint with ruby. I can't work out from the
comments why the head event needs to be taken off in the first place.
Thanks
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
Nilay Vaish
2015-06-13 14:48:18 UTC
Permalink
Your bisection is not right. You might want to take a look at the
following changeset:


changeset: 10756:f9c0692f73ec
user: Curtis Dunham <***@arm.com>
date: Mon Mar 23 06:57:36 2015 -0400
summary: sim: Reuse the same limit_event in simulate()


I suggest that you revert this changeset in your repo while I think about
what needs to be done.

--
Nilay
Post by Timothy M Jones
Hi again,
Further to this message, I've used hg bisect to find the revision that breaks
checkpointing with ruby. It's revision 10524 that Nilay committed in
November that's the first bad changeset. It fails with the panic() on the
missing event that I wrote about previously.
I've scanned through the diff and can't immediately see any reason why this
would break serialisation, although it does remove some of the code to
serialise ruby state.
Could anyone (Nilay?) give me a hint as to why this might break checkpointing
with ruby?
I've compiled with the MOESI_hammer protocol for x86, then run with this
./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir> configs/example/fs.py -n
1 --kernel <my-kernel> --script configs/boot/hack_back_ckpt.rcS
--max-checkpoints 1 --checkpoint-dir <cptdir> --disk-image <my-disk-image>
--cpu-type timing --restore-with timing --ruby
Any help would be appreciated. I don't know ruby at all, so trying to work
out what's going on is slow....
Cheers
Tim
Post by Timothy M Jones
Hello,
Could someone tell me why we need to take the head event off the event
queue in RubySystem::serialize() in src/mem/ruby/system/System.cc?
Event* eventq_head = eventq->replaceHead(NULL);
The problem I'm getting is that when simulate() is called a few lines
later, it tries to reschedule the simulate_limit_event, but that causes
a panic because it's no longer on the event queue. This is happening
when trying to take a checkpoint with ruby. I can't work out from the
comments why the head event needs to be taken off in the first place.
Thanks
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Joel Hestness
2015-06-13 17:03:21 UTC
Permalink
Hey guys,
I'm pretty sure Tim is correct that the checkpointing bugs were
introduced earlier than the changeset Nilay points to; gem5-gpu is
currently using gem5 rev 10645 <http://repo.gem5.org/gem5/rev/cd95d4d51659>,
and we cannot get reliable checkpoint and restore with it. Note that Tim's
bug may not be the only checkpointing bug that exists right now.

To answer Tim's question: While taking a checkpoint, Ruby commandeers the
event queue to inject flushing memory accesses into the caches. This is
used to generate a trace of cache contents, which can be used to warm up
the caches on checkpoint restore. To take over control of the event queue,
Ruby clears the event at the queue head (I think this assumes there is only
1 event in the queue? This should probably be checked), and then adds it's
own event for the cache flushing operation. After the caches have been
flushed (simulate() call in RubySystem::serialize()), Ruby restores the
head event that was in the queue and rolls back the current tick.

One way to check if this cooldown operation is at fault for unreliable
checkpointing is to simply comment out the event queue commandeering, and
try to take a checkpoint. You may also be able to test checkpoint restore
by commenting the cache warm-up code in RubySystem::unserialize(). If
checkpoint and restore work without the event queue commandeering, it is
likely that the event queue manipulation is problematic.

I'd also recommend trying to take a checkpoint and restore with
simulation specifying the gem5 flag --debug-flag=RubyCacheTrace, which will
show what the cache flushing and warm-up are doing, respectively.

Joel
Post by Nilay Vaish
Your bisection is not right. You might want to take a look at the
changeset: 10756:f9c0692f73ec
date: Mon Mar 23 06:57:36 2015 -0400
summary: sim: Reuse the same limit_event in simulate()
I suggest that you revert this changeset in your repo while I think about
what needs to be done.
--
Nilay
Hi again,
Post by Timothy M Jones
Further to this message, I've used hg bisect to find the revision that
breaks checkpointing with ruby. It's revision 10524 that Nilay committed
in November that's the first bad changeset. It fails with the panic() on
the missing event that I wrote about previously.
I've scanned through the diff and can't immediately see any reason why
this would break serialisation, although it does remove some of the code to
serialise ruby state.
Could anyone (Nilay?) give me a hint as to why this might break
checkpointing with ruby?
I've compiled with the MOESI_hammer protocol for x86, then run with this
./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir>
configs/example/fs.py -n 1 --kernel <my-kernel> --script
configs/boot/hack_back_ckpt.rcS --max-checkpoints 1 --checkpoint-dir
<cptdir> --disk-image <my-disk-image> --cpu-type timing --restore-with
timing --ruby
Any help would be appreciated. I don't know ruby at all, so trying to
work out what's going on is slow....
Cheers
Tim
Post by Timothy M Jones
Hello,
Could someone tell me why we need to take the head event off the event
queue in RubySystem::serialize() in src/mem/ruby/system/System.cc?
Event* eventq_head = eventq->replaceHead(NULL);
The problem I'm getting is that when simulate() is called a few lines
later, it tries to reschedule the simulate_limit_event, but that causes
a panic because it's no longer on the event queue. This is happening
when trying to take a checkpoint with ruby. I can't work out from the
comments why the head event needs to be taken off in the first place.
Thanks
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
--
Joel Hestness
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Timothy M Jones
2015-06-17 10:15:56 UTC
Permalink
Thanks Nilay and Joel for the information.

I've been playing around with this over the past few days and I can't
work out what the point of the flush is. The CacheRecorder already has
a copy of all the data blocks in the trace before the flush starts.
Removing the flush event and subsequent simulation produces exactly the
same system.ruby.cache.gz file as with it in, so I guess it's safe to
remove them....

So, with that out of the way, I can create checkpoints and exit the
simulator correctly. I'm not 100% sure about restoring the checkpoint
though, and it seems a little hacky. Is there a reason it has to
unserialise by inserting memory requests into the event queue - couldn't
it just write the data into the correct locations in the caches?

There's also a question about whether ruby should be recording its state
anyway. Shouldn't it be doing the same as the classic memory system
caches and implementing memWriteback() to flush all dirty data out
before checkpointing happens, then it doesn't need to trace anything?
(Maybe I'm opening a can of worms, but I thought I'd just ask!)

Cheers
Tim
Post by Joel Hestness
Hey guys,
I'm pretty sure Tim is correct that the checkpointing bugs were
introduced earlier than the changeset Nilay points to; gem5-gpu is
currently using gem5 rev 10645
<http://repo.gem5.org/gem5/rev/cd95d4d51659>, and we cannot get reliable
checkpoint and restore with it. Note that Tim's bug may not be the only
checkpointing bug that exists right now.
To answer Tim's question: While taking a checkpoint, Ruby commandeers
the event queue to inject flushing memory accesses into the caches. This
is used to generate a trace of cache contents, which can be used to warm
up the caches on checkpoint restore. To take over control of the event
queue, Ruby clears the event at the queue head (I think this assumes
there is only 1 event in the queue? This should probably be checked),
and then adds it's own event for the cache flushing operation. After the
caches have been flushed (simulate() call in RubySystem::serialize()),
Ruby restores the head event that was in the queue and rolls back the
current tick.
One way to check if this cooldown operation is at fault for
unreliable checkpointing is to simply comment out the event queue
commandeering, and try to take a checkpoint. You may also be able to
test checkpoint restore by commenting the cache warm-up code in
RubySystem::unserialize(). If checkpoint and restore work without the
event queue commandeering, it is likely that the event queue
manipulation is problematic.
I'd also recommend trying to take a checkpoint and restore with
simulation specifying the gem5 flag --debug-flag=RubyCacheTrace, which
will show what the cache flushing and warm-up are doing, respectively.
Joel
Your bisection is not right. You might want to take a look at the
changeset: 10756:f9c0692f73ec
date: Mon Mar 23 06:57:36 2015 -0400
summary: sim: Reuse the same limit_event in simulate()
I suggest that you revert this changeset in your repo while I think
about what needs to be done.
--
Nilay
Hi again,
Further to this message, I've used hg bisect to find the
revision that breaks checkpointing with ruby. It's revision
10524 that Nilay committed in November that's the first bad
changeset. It fails with the panic() on the missing event that
I wrote about previously.
I've scanned through the diff and can't immediately see any
reason why this would break serialisation, although it does
remove some of the code to serialise ruby state.
Could anyone (Nilay?) give me a hint as to why this might break
checkpointing with ruby?
I've compiled with the MOESI_hammer protocol for x86, then run
./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir>
configs/example/fs.py -n 1 --kernel <my-kernel> --script
configs/boot/hack_back_ckpt.rcS --max-checkpoints 1
--checkpoint-dir <cptdir> --disk-image <my-disk-image>
--cpu-type timing --restore-with timing --ruby
Any help would be appreciated. I don't know ruby at all, so
trying to work out what's going on is slow....
Cheers
Tim
Hello,
Could someone tell me why we need to take the head event
off the event
queue in RubySystem::serialize() in
src/mem/ruby/system/System.cc?
Event* eventq_head = eventq->replaceHead(NULL);
The problem I'm getting is that when simulate() is called
a few lines
later, it tries to reschedule the simulate_limit_event,
but that causes
a panic because it's no longer on the event queue. This
is happening
when trying to take a checkpoint with ruby. I can't work
out from the
comments why the head event needs to be taken off in the
first place.
Thanks
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
--
Joel Hestness
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
Beckmann, Brad
2015-06-17 20:38:11 UTC
Permalink
Hi Tim,

I have not been completely following this thread, but I can answer your question about unserializing cache contents.

The benefit for creating at trace, rather than just inserting data into the cache, is two-fold. First, by creating a trace from a very large cache system, one can warmup caches of different sizes, associativities and even completely different cache hierarchies/configurations from a single trace. Second, and probably more important, Ruby protocols rely on timing requests to set cache block state to the unique states used by a particular protocol. Often Ruby is used to compare different protocols and this process allows us to compare protocols using the exact same checkpoint.

I hope that helps,

Brad




-----Original Message-----
From: gem5-dev [mailto:gem5-dev-***@gem5.org] On Behalf Of Timothy M Jones
Sent: Wednesday, June 17, 2015 3:16 AM
To: gem5 Developer List
Subject: Re: [gem5-dev] Ruby serialize removing event queue head

Thanks Nilay and Joel for the information.

I've been playing around with this over the past few days and I can't work out what the point of the flush is. The CacheRecorder already has a copy of all the data blocks in the trace before the flush starts.
Removing the flush event and subsequent simulation produces exactly the same system.ruby.cache.gz file as with it in, so I guess it's safe to remove them....

So, with that out of the way, I can create checkpoints and exit the simulator correctly. I'm not 100% sure about restoring the checkpoint though, and it seems a little hacky. Is there a reason it has to unserialise by inserting memory requests into the event queue - couldn't it just write the data into the correct locations in the caches?

There's also a question about whether ruby should be recording its state anyway. Shouldn't it be doing the same as the classic memory system caches and implementing memWriteback() to flush all dirty data out before checkpointing happens, then it doesn't need to trace anything?
(Maybe I'm opening a can of worms, but I thought I'd just ask!)

Cheers
Tim
Post by Joel Hestness
Hey guys,
I'm pretty sure Tim is correct that the checkpointing bugs were
introduced earlier than the changeset Nilay points to; gem5-gpu is
currently using gem5 rev 10645
<http://repo.gem5.org/gem5/rev/cd95d4d51659>, and we cannot get
reliable checkpoint and restore with it. Note that Tim's bug may not
be the only checkpointing bug that exists right now.
To answer Tim's question: While taking a checkpoint, Ruby
commandeers the event queue to inject flushing memory accesses into
the caches. This is used to generate a trace of cache contents, which
can be used to warm up the caches on checkpoint restore. To take over
control of the event queue, Ruby clears the event at the queue head (I
think this assumes there is only 1 event in the queue? This should
probably be checked), and then adds it's own event for the cache
flushing operation. After the caches have been flushed (simulate()
call in RubySystem::serialize()), Ruby restores the head event that
was in the queue and rolls back the current tick.
One way to check if this cooldown operation is at fault for
unreliable checkpointing is to simply comment out the event queue
commandeering, and try to take a checkpoint. You may also be able to
test checkpoint restore by commenting the cache warm-up code in
RubySystem::unserialize(). If checkpoint and restore work without the
event queue commandeering, it is likely that the event queue
manipulation is problematic.
I'd also recommend trying to take a checkpoint and restore with
simulation specifying the gem5 flag --debug-flag=RubyCacheTrace, which
will show what the cache flushing and warm-up are doing, respectively.
Joel
Your bisection is not right. You might want to take a look at the
changeset: 10756:f9c0692f73ec
date: Mon Mar 23 06:57:36 2015 -0400
summary: sim: Reuse the same limit_event in simulate()
I suggest that you revert this changeset in your repo while I think
about what needs to be done.
--
Nilay
Hi again,
Further to this message, I've used hg bisect to find the
revision that breaks checkpointing with ruby. It's revision
10524 that Nilay committed in November that's the first bad
changeset. It fails with the panic() on the missing event that
I wrote about previously.
I've scanned through the diff and can't immediately see any
reason why this would break serialisation, although it does
remove some of the code to serialise ruby state.
Could anyone (Nilay?) give me a hint as to why this might break
checkpointing with ruby?
I've compiled with the MOESI_hammer protocol for x86, then run
./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir>
configs/example/fs.py -n 1 --kernel <my-kernel> --script
configs/boot/hack_back_ckpt.rcS --max-checkpoints 1
--checkpoint-dir <cptdir> --disk-image <my-disk-image>
--cpu-type timing --restore-with timing --ruby
Any help would be appreciated. I don't know ruby at all, so
trying to work out what's going on is slow....
Cheers
Tim
Hello,
Could someone tell me why we need to take the head event
off the event
queue in RubySystem::serialize() in
src/mem/ruby/system/System.cc?
Event* eventq_head = eventq->replaceHead(NULL);
The problem I'm getting is that when simulate() is called
a few lines
later, it tries to reschedule the simulate_limit_event,
but that causes
a panic because it's no longer on the event queue. This
is happening
when trying to take a checkpoint with ruby. I can't work
out from the
comments why the head event needs to be taken off in the
first place.
Thanks
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
--
Joel Hestness
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
gem5-***@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev
Timothy M Jones
2015-06-18 07:31:54 UTC
Permalink
Hi Brad,
Post by Beckmann, Brad
The benefit for creating at trace, rather than just inserting data into the cache, is two-fold. First, by creating a trace from a very large cache system, one can warmup caches of different sizes, associativities and even completely different cache hierarchies/configurations from a single trace. Second, and probably more important, Ruby protocols rely on timing requests to set cache block state to the unique states used by a particular protocol. Often Ruby is used to compare different protocols and this process allows us to compare protocols using the exact same checkpoint.
Thanks for the explanation. OK, so I understand why you want to have a
trace, but is there any need for it, or could you just start at a
checkpoint with a totally empty cache (as in the classic model)?
Basically, is this trace simply a way to avoid the need to warm up the
caches after a checkpoint?

At the moment I can create the trace at a checkpoint, which is progress,
but I get problems both in the simulator and simulated system when
restoring from the checkpoint. I'd like to know whether to invest the
time in getting this to work, or whether I should simply implement
memWriteback() for ruby to flush dirty data before a checkpoint, then do
away with the trace altogether.

Cheers
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
Joel Hestness
2015-06-18 14:35:54 UTC
Permalink
Hi Tim,
I think we should keep the cache tracing functionality. I've used cache
warm-up after taking repeated checkpoints to find particular system
activity levels, and I only restore+simulate those that meet some criteria
(i.e. like simpoints). Often, the intervals simulated between these
checkpoints are fine-grained, so it is important to be able to do cache
cool-down and warm-up in a slim, automatic way. I expect we would rather
not try to figure out another means of warming up caches.

Can you describe the restore problems you're running into? Perhaps we can
help debug.

Joel
Post by Timothy M Jones
Hi Brad,
Post by Beckmann, Brad
The benefit for creating at trace, rather than just inserting data into
the cache, is two-fold. First, by creating a trace from a very large cache
system, one can warmup caches of different sizes, associativities and even
completely different cache hierarchies/configurations from a single trace.
Second, and probably more important, Ruby protocols rely on timing requests
to set cache block state to the unique states used by a particular
protocol. Often Ruby is used to compare different protocols and this
process allows us to compare protocols using the exact same checkpoint.
Thanks for the explanation. OK, so I understand why you want to have a
trace, but is there any need for it, or could you just start at a
checkpoint with a totally empty cache (as in the classic model)? Basically,
is this trace simply a way to avoid the need to warm up the caches after a
checkpoint?
At the moment I can create the trace at a checkpoint, which is progress,
but I get problems both in the simulator and simulated system when
restoring from the checkpoint. I'd like to know whether to invest the time
in getting this to work, or whether I should simply implement
memWriteback() for ruby to flush dirty data before a checkpoint, then do
away with the trace altogether.
Cheers
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
--
Joel Hestness
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Timothy M Jones
2015-06-19 08:22:18 UTC
Permalink
Hi Joel,
Post by Joel Hestness
I think we should keep the cache tracing functionality. I've used cache
warm-up after taking repeated checkpoints to find particular system
activity levels, and I only restore+simulate those that meet some criteria
(i.e. like simpoints). Often, the intervals simulated between these
checkpoints are fine-grained, so it is important to be able to do cache
cool-down and warm-up in a slim, automatic way. I expect we would rather
not try to figure out another means of warming up caches.
Ah, I see. No worries, I just wanted to be sure that we want to keep
this functionality before I dive into it too much.
Post by Joel Hestness
Can you describe the restore problems you're running into? Perhaps we can
help debug.
The biggest problem is a seg fault within the simulated program, which
happens almost immediately after restoration. An example output is
below. I think my first step will be to verify that the blocks in each
of the caches at the point the checkpoint is taken is the same as after
the checkpoint has been restored.

Cheers
Tim


x264[1016]: segfault at 28 ip 00007ffbcb65dbc4 sp 00007fffe7d5f780 error
4 in ld-2.6.1.so[7ffbcb655000+1b000]
------------[ cut here ]------------
kernel BUG at mm/mmap.c:2274!
invalid opcode: 0000 [#1] SMP
CPU 0
Modules linked in:

Pid: 1016, comm: x264 Not tainted 3.2.24 #1
RIP: 0010:[<ffffffff810c00ae>] [<ffffffff810c00ae>] exit_mmap+0xfe/0x100
RSP: 0000:ffff88001e1cfc08 EFLAGS: 0000022c
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88001e1cffd8
RDX: 000000000000006a RSI: ffff88001e1a1330 RDI: ffff88001e81e380
RBP: ffff88001e1b1740 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88001e8fb8d0 R11: ffffffff81739cc0 R12: 00007fffe7e00000
R13: 0000000000000000 R14: 000000000000012a R15: ffff88001e930930
FS: 00007ffbcb8666f0(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ffbcb65dbc4 CR3: 00000000016e3000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
Process x264 (pid: 1016, threadinfo ffff88001e1ce000, task ffff88001e930930)
Stack:
00000000000006d5 ffff88001e1b1740 000000011e1cfc88 ffff88001e1cfc28
0000000000000000 0000000800000000 ffffea00006e6a38 ffffea00006e6a00
ffffea00006e69c8 ffffea00006e6990 ffffea00006e7100 ffffea00006e70c8
Call Trace:
[<ffffffff810385d0>] ? mmput+0x30/0xe0
[<ffffffff8103cd07>] ? exit_mm+0xf7/0x120
[<ffffffff8105c2da>] ? hrtimer_try_to_cancel+0x6a/0xb0
[<ffffffff8103e433>] ? do_exit+0x133/0x760
[<ffffffff8103ead8>] ? do_group_exit+0x38/0xa0
[<ffffffff8104c95e>] ? get_signal_to_deliver+0x19e/0x550
[<ffffffff810016e4>] ? do_signal+0x44/0x700
[<ffffffff814aac0d>] ? do_page_fault+0x3fd/0x490
[<ffffffff8110ae3b>] ? fsnotify+0x24b/0x330
[<ffffffff8102b8df>] ? __wake_up+0x2f/0x50
[<ffffffff81223d10>] ? process_echoes+0x20/0x20
[<ffffffff81001e09>] ? do_notify_resume+0x49/0x50
[<ffffffff814a7d36>] ? retint_signal+0x3d/0x77
Code: e8 08 72 ff ff 0f 1f 84 00 00 00 00 00 48 89 df e8 68 d7 ff ff 48
85 c0 48 89 c3 75 f0 48 83 bd e0 00 00 00 00 0f 84 2a ff ff ff <0f> 0b
41 57 41 56 41 55 49 89 fd 41 54 49 89 f4 55 53 48 83 ec
RIP [<ffffffff810c00ae>] exit_mmap+0xfe/0x100
RSP <ffff88001e1cfc08>
---[ end trace d214638988f52ea9 ]---
Fixing recursive fault but reboot is needed!
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
Joel Hestness
2015-06-19 15:16:25 UTC
Permalink
Hi Tim,
Post by Timothy M Jones
Post by Joel Hestness
I think we should keep the cache tracing functionality. I've used cache
warm-up after taking repeated checkpoints to find particular system
activity levels, and I only restore+simulate those that meet some criteria
(i.e. like simpoints). Often, the intervals simulated between these
checkpoints are fine-grained, so it is important to be able to do cache
cool-down and warm-up in a slim, automatic way. I expect we would rather
not try to figure out another means of warming up caches.
Ah, I see. No worries, I just wanted to be sure that we want to keep
this functionality before I dive into it too much.
Alright. Thanks.


Can you describe the restore problems you're running into? Perhaps we
Post by Timothy M Jones
Post by Joel Hestness
can
help debug.
The biggest problem is a seg fault within the simulated program, which
happens almost immediately after restoration. An example output is below.
I think my first step will be to verify that the blocks in each of the
caches at the point the checkpoint is taken is the same as after the
checkpoint has been restored.
Cheers
Tim
x264[1016]: segfault at 28 ip 00007ffbcb65dbc4 sp 00007fffe7d5f780 error 4
in ld-2.6.1.so[7ffbcb655000+1b000]
------------[ cut here ]------------
kernel BUG at mm/mmap.c:2274!
invalid opcode: 0000 [#1] SMP
CPU 0
Pid: 1016, comm: x264 Not tainted 3.2.24 #1
RIP: 0010:[<ffffffff810c00ae>] [<ffffffff810c00ae>] exit_mmap+0xfe/0x100
RSP: 0000:ffff88001e1cfc08 EFLAGS: 0000022c
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff88001e1cffd8
RDX: 000000000000006a RSI: ffff88001e1a1330 RDI: ffff88001e81e380
RBP: ffff88001e1b1740 R08: 0000000000000000 R09: 0000000000000000
R10: ffff88001e8fb8d0 R11: ffffffff81739cc0 R12: 00007fffe7e00000
R13: 0000000000000000 R14: 000000000000012a R15: ffff88001e930930
FS: 00007ffbcb8666f0(0000) GS:ffff88001fc00000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ffbcb65dbc4 CR3: 00000000016e3000 CR4: 00000000000006b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
Process x264 (pid: 1016, threadinfo ffff88001e1ce000, task
ffff88001e930930)
00000000000006d5 ffff88001e1b1740 000000011e1cfc88 ffff88001e1cfc28
0000000000000000 0000000800000000 ffffea00006e6a38 ffffea00006e6a00
ffffea00006e69c8 ffffea00006e6990 ffffea00006e7100 ffffea00006e70c8
[<ffffffff810385d0>] ? mmput+0x30/0xe0
[<ffffffff8103cd07>] ? exit_mm+0xf7/0x120
[<ffffffff8105c2da>] ? hrtimer_try_to_cancel+0x6a/0xb0
[<ffffffff8103e433>] ? do_exit+0x133/0x760
[<ffffffff8103ead8>] ? do_group_exit+0x38/0xa0
[<ffffffff8104c95e>] ? get_signal_to_deliver+0x19e/0x550
[<ffffffff810016e4>] ? do_signal+0x44/0x700
[<ffffffff814aac0d>] ? do_page_fault+0x3fd/0x490
[<ffffffff8110ae3b>] ? fsnotify+0x24b/0x330
[<ffffffff8102b8df>] ? __wake_up+0x2f/0x50
[<ffffffff81223d10>] ? process_echoes+0x20/0x20
[<ffffffff81001e09>] ? do_notify_resume+0x49/0x50
[<ffffffff814a7d36>] ? retint_signal+0x3d/0x77
Code: e8 08 72 ff ff 0f 1f 84 00 00 00 00 00 48 89 df e8 68 d7 ff ff 48 85
c0 48 89 c3 75 f0 48 83 bd e0 00 00 00 00 0f 84 2a ff ff ff <0f> 0b 41 57
41 56 41 55 49 89 fd 41 54 49 89 f4 55 53 48 83 ec
RIP [<ffffffff810c00ae>] exit_mmap+0xfe/0x100
RSP <ffff88001e1cfc08>
---[ end trace d214638988f52ea9 ]---
Fixing recursive fault but reboot is needed!
Unsolicited hint: this looks like it could be a bug in x264 or libraries
(depending on which version of Linux you've booted). ld-2.6.1 is 7 years
old, and may only work with older versions of Linux (e.g. 2.6.28.4). If
you're booting a newer Linux kernel, you might be running into
kernel<>library version issues. Have you tried running a simpler
application on checkpoint restore (e.g. hello in
gem5/tests/test-progs/hello/src/)?


Joel
--
Joel Hestness
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Timothy M Jones
2015-06-19 15:47:07 UTC
Permalink
Hi Joel,
Post by Joel Hestness
Unsolicited hint: this looks like it could be a bug in x264 or libraries
(depending on which version of Linux you've booted). ld-2.6.1 is 7 years
old, and may only work with older versions of Linux (e.g. 2.6.28.4). If
you're booting a newer Linux kernel, you might be running into
kernel<>library version issues. Have you tried running a simpler
application on checkpoint restore (e.g. hello in
gem5/tests/test-progs/hello/src/)?
I haven't tried that, but I will. However, if I continue executing
beyond the checkpoint then simulation proceeds normally - this seg fault
only appears after a checkpoint restore.

Cheers
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
Timothy M Jones
2015-06-22 19:15:12 UTC
Permalink
Hi All,
Post by Timothy M Jones
I think my first step will be to verify that the blocks in each of the
caches at the point the checkpoint is taken is the same as after the
checkpoint has been restored.
I have tracked down the problem to a particular line in one of the L1
caches. There's a fairly simple (and obvious) check on each block's
access permissions to determine what request to make in the trace to
restore state:

1) Read-only and instruction cache => Instruction fetch request
2) Read-only otherwise => Load request
3) Read-write permissions => Store request

However, I've got a block with read-only permission that still manages
to contain dirty data! Could anyone tell me how this can happen?

Cheers
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
Joel Hestness
2015-06-22 19:35:24 UTC
Permalink
Hi Tim,
I'm not sure whether this is really a bug. Dirty cache lines could be
shared in a read-only state among caches. Whether to allow data in caches
to differ from memory (i.e. dirty bit set) is a choice by a directory of a
protocol. It could write a line back to memory to clean it, or allow shared
(read-only) copies that are the same, but dirty among caches. I'm not sure
whether the MOESI_hammer directory allows this, but it can be a possibility.

Are you still testing this with MOESI_hammer? Also, I'm not sure what
code you're referring to that checks the access permissions and dirty bits
(during cache warm-up?). Can you point us to that code?

Thanks!
Joel
Post by Timothy M Jones
Hi All,
Post by Timothy M Jones
I think my first step will be to verify that the blocks in each of the
caches at the point the checkpoint is taken is the same as after the
checkpoint has been restored.
I have tracked down the problem to a particular line in one of the L1
caches. There's a fairly simple (and obvious) check on each block's access
1) Read-only and instruction cache => Instruction fetch request
2) Read-only otherwise => Load request
3) Read-write permissions => Store request
However, I've got a block with read-only permission that still manages to
contain dirty data! Could anyone tell me how this can happen?
Cheers
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
--
Joel Hestness
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Timothy M Jones
2015-06-22 20:08:59 UTC
Permalink
Hi Joel,
Post by Joel Hestness
I'm not sure whether this is really a bug.
No, I'm sure that this isn't the bug. The problem is that when the line
is restored it isn't restored to the same state (my guess so far is that
it isn't consistent with memory and because this dirty bit isn't
preserved in the checkpoint / ruby trace, it causes the wrong data to be
used somewhere later down the line.
Post by Joel Hestness
Dirty cache lines could be
shared in a read-only state among caches. Whether to allow data in caches
to differ from memory (i.e. dirty bit set) is a choice by a directory of a
protocol. It could write a line back to memory to clean it, or allow shared
(read-only) copies that are the same, but dirty among caches. I'm not sure
whether the MOESI_hammer directory allows this, but it can be a possibility.
OK, thanks for the explanation.
Post by Joel Hestness
Are you still testing this with MOESI_hammer?
Yes, I am.
Post by Joel Hestness
Also, I'm not sure what
code you're referring to that checks the access permissions and dirty bits
(during cache warm-up?). Can you point us to that code?
Yeah, the problem is that it doesn't check the dirty bits, just the
access permissions. It's in src/mem/ruby/structures/CacheMemory.cc:326,
function recordCacheContents()

Cheers
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
Joel Hestness
2015-06-23 00:12:56 UTC
Permalink
Hi Tim,
I'm having a bit of trouble following, so perhaps I can try to reiterate
your progress:
1) After a checkpoint restore, your simulated benchmark crashes with a
memory bug
2) You've tracked down that a cache line may not have been properly
saved into the cache trace
- During checkpointing, the line had read-only permissions, but was
dirty (i.e. probably differed from the version in memory)
3) On restore, something is happening with this cache line that is
causing a memory bug. Can you elaborate on this?

It seems like maybe the restored cache line contains dirty data that
needs to get written back to memory, but maybe the line gets evicted (i.e.
dirty data gets lost), because the dirty status is not saved during
checkpoint cache recording? If this is the case, this is surprising,
because the RubySystem also flushes cache data back to memory during
serialize(). This should ensure that the memory should have a correct
version of the data, whether or not it was dirty.

I can imagine a few possible problems here:
A) MOESI_hammer has a bug that allows a dirty line to be shared read-only
B) MOESI_hammer doesn't properly implement the flush to push the dirty
data back to memory
C) The memory checkpoint is being taken before the RubySystem cache
flushing, which would mean that memory's contents do not contain dirty data

(C) seems most probable given that the changeset 10524 moved memories out
of Ruby (note in that changeset that memory checkpointing occurred in
RubySystem::serialize() AFTER the cache flush operation). Can you check
whether the RubySystem or the memories execute serialize() first?

Joel
Post by Timothy M Jones
Hi Joel,
Post by Joel Hestness
I'm not sure whether this is really a bug.
No, I'm sure that this isn't the bug. The problem is that when the line
is restored it isn't restored to the same state (my guess so far is that it
isn't consistent with memory and because this dirty bit isn't preserved in
the checkpoint / ruby trace, it causes the wrong data to be used somewhere
later down the line.
Dirty cache lines could be
Post by Joel Hestness
shared in a read-only state among caches. Whether to allow data in caches
to differ from memory (i.e. dirty bit set) is a choice by a directory of a
protocol. It could write a line back to memory to clean it, or allow shared
(read-only) copies that are the same, but dirty among caches. I'm not sure
whether the MOESI_hammer directory allows this, but it can be a possibility.
OK, thanks for the explanation.
Are you still testing this with MOESI_hammer?
Yes, I am.
Also, I'm not sure what
Post by Joel Hestness
code you're referring to that checks the access permissions and dirty bits
(during cache warm-up?). Can you point us to that code?
Yeah, the problem is that it doesn't check the dirty bits, just the access
permissions. It's in src/mem/ruby/structures/CacheMemory.cc:326, function
recordCacheContents()
Cheers
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
--
Joel Hestness
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison
http://pages.cs.wisc.edu/~hestness/
Timothy M Jones
2015-06-23 14:34:34 UTC
Permalink
Post by Joel Hestness
(C) seems most probable given that the changeset 10524 moved memories out
of Ruby (note in that changeset that memory checkpointing occurred in
RubySystem::serialize() AFTER the cache flush operation). Can you check
whether the RubySystem or the memories execute serialize() first?
Thank you Joel, this is exactly what I was getting towards, but your
summary made it all the faster getting there. This was the issue.

I have now fixed this problem and the original one too - I'll post a
patch on reviewboard shortly for comments, once I've tested it a little
more.

Cheers
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
Beckmann, Brad
2015-06-22 17:57:39 UTC
Permalink
You could certainly do that. You are absolutely correct that a Ruby simulation does not need a trace to run from a checkpoint.

Brad



-----Original Message-----
From: gem5-dev [mailto:gem5-dev-***@gem5.org] On Behalf Of Timothy M Jones
Sent: Thursday, June 18, 2015 12:32 AM
To: gem5-***@gem5.org
Subject: Re: [gem5-dev] Ruby serialize removing event queue head

Hi Brad,
Post by Beckmann, Brad
The benefit for creating at trace, rather than just inserting data into the cache, is two-fold. First, by creating a trace from a very large cache system, one can warmup caches of different sizes, associativities and even completely different cache hierarchies/configurations from a single trace. Second, and probably more important, Ruby protocols rely on timing requests to set cache block state to the unique states used by a particular protocol. Often Ruby is used to compare different protocols and this process allows us to compare protocols using the exact same checkpoint.
Thanks for the explanation. OK, so I understand why you want to have a trace, but is there any need for it, or could you just start at a checkpoint with a totally empty cache (as in the classic model)?
Basically, is this trace simply a way to avoid the need to warm up the caches after a checkpoint?

At the moment I can create the trace at a checkpoint, which is progress, but I get problems both in the simulator and simulated system when restoring from the checkpoint. I'd like to know whether to invest the time in getting this to work, or whether I should simply implement
memWriteback() for ruby to flush dirty data before a checkpoint, then do away with the trace altogether.

Cheers
Tim
--
Timothy M. Jones
http://www.cl.cam.ac.uk/~tmj32/
_______________________________________________
gem5-dev mailing list
gem5-***@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev
Loading...