I have not been completely following this thread, but I can answer your question about unserializing cache contents.
The benefit for creating at trace, rather than just inserting data into the cache, is two-fold. First, by creating a trace from a very large cache system, one can warmup caches of different sizes, associativities and even completely different cache hierarchies/configurations from a single trace. Second, and probably more important, Ruby protocols rely on timing requests to set cache block state to the unique states used by a particular protocol. Often Ruby is used to compare different protocols and this process allows us to compare protocols using the exact same checkpoint.
Thanks Nilay and Joel for the information.
I've been playing around with this over the past few days and I can't work out what the point of the flush is. The CacheRecorder already has a copy of all the data blocks in the trace before the flush starts.
Removing the flush event and subsequent simulation produces exactly the same system.ruby.cache.gz file as with it in, so I guess it's safe to remove them....
So, with that out of the way, I can create checkpoints and exit the simulator correctly. I'm not 100% sure about restoring the checkpoint though, and it seems a little hacky. Is there a reason it has to unserialise by inserting memory requests into the event queue - couldn't it just write the data into the correct locations in the caches?
There's also a question about whether ruby should be recording its state anyway. Shouldn't it be doing the same as the classic memory system caches and implementing memWriteback() to flush all dirty data out before checkpointing happens, then it doesn't need to trace anything?
Post by Joel Hestness
I'm pretty sure Tim is correct that the checkpointing bugs were
introduced earlier than the changeset Nilay points to; gem5-gpu is
currently using gem5 rev 10645
<http://repo.gem5.org/gem5/rev/cd95d4d51659>, and we cannot get
reliable checkpoint and restore with it. Note that Tim's bug may not
be the only checkpointing bug that exists right now.
To answer Tim's question: While taking a checkpoint, Ruby
commandeers the event queue to inject flushing memory accesses into
the caches. This is used to generate a trace of cache contents, which
can be used to warm up the caches on checkpoint restore. To take over
control of the event queue, Ruby clears the event at the queue head (I
think this assumes there is only 1 event in the queue? This should
probably be checked), and then adds it's own event for the cache
flushing operation. After the caches have been flushed (simulate()
call in RubySystem::serialize()), Ruby restores the head event that
was in the queue and rolls back the current tick.
One way to check if this cooldown operation is at fault for
unreliable checkpointing is to simply comment out the event queue
commandeering, and try to take a checkpoint. You may also be able to
test checkpoint restore by commenting the cache warm-up code in
RubySystem::unserialize(). If checkpoint and restore work without the
event queue commandeering, it is likely that the event queue
manipulation is problematic.
I'd also recommend trying to take a checkpoint and restore with
simulation specifying the gem5 flag --debug-flag=RubyCacheTrace, which
will show what the cache flushing and warm-up are doing, respectively.
Your bisection is not right. You might want to take a look at the
date: Mon Mar 23 06:57:36 2015 -0400
summary: sim: Reuse the same limit_event in simulate()
I suggest that you revert this changeset in your repo while I think
about what needs to be done.
Further to this message, I've used hg bisect to find the
revision that breaks checkpointing with ruby. It's revision
10524 that Nilay committed in November that's the first bad
changeset. It fails with the panic() on the missing event that
I wrote about previously.
I've scanned through the diff and can't immediately see any
reason why this would break serialisation, although it does
remove some of the code to serialise ruby state.
Could anyone (Nilay?) give me a hint as to why this might break
checkpointing with ruby?
I've compiled with the MOESI_hammer protocol for x86, then run
./build/X86/gem5.opt --remote-gdb-port=0 -d <outdir>
configs/example/fs.py -n 1 --kernel <my-kernel> --script
configs/boot/hack_back_ckpt.rcS --max-checkpoints 1
--checkpoint-dir <cptdir> --disk-image <my-disk-image>
--cpu-type timing --restore-with timing --ruby
Any help would be appreciated. I don't know ruby at all, so
trying to work out what's going on is slow....
Could someone tell me why we need to take the head event
off the event
queue in RubySystem::serialize() in
Event* eventq_head = eventq->replaceHead(NULL);
The problem I'm getting is that when simulate() is called
a few lines
later, it tries to reschedule the simulate_limit_event,
but that causes
a panic because it's no longer on the event queue. This
when trying to take a checkpoint with ruby. I can't work
out from the
comments why the head event needs to be taken off in the
Timothy M. Jones
gem5-dev mailing list
gem5-dev mailing list
PhD Candidate, Computer Architecture
Dept. of Computer Science, University of Wisconsin - Madison