Discussion:
[gem5-dev] custom allocator for Request/Packet objects
Gabe Black
2018-11-29 03:21:56 UTC
Permalink
I was just wondering whether it would make sense to have custom allocators
defined for Request and Packet types which would keep around a pool of them
rather than defaulting to the normal allocator. I suspect since both types
of objects are allocated very frequently this could save a lot of heap
overhead.

I think there are two complications to doing this. First, we'd want to make
sure Request and Packet objects are truly those objects and are of the
appropriate size. Subclasses could add new members and make the types
bigger. I think to ensure that, we'd need to add the "final" keyword (added
in C++11) to ensure those types can't be subclassed and have unpredictable
sizes.

Second, we use make_shared a lot to build Requests, and that actually
allocates a larger amount of memory to hold the object and reference count
information. It allocates that larger object with new, and it looks like
you're supposed to use allocate_shared if you want to use a custom
allocator. Writing a custom allocator looks relatively complicated, but it
might make a big difference if it works correctly.

Also we'd probably want to put the Request allocator incantation in a small
function rather than calling make_shared or allocate_shared directly to
avoid a lot of boilerplate and churn if things need to change.

Thoughts? Like I said I suspect this may make a significant difference, but
it might not be easy to implement and may not actually make much of a
difference at all.

Gabe
Jason Lowe-Power
2018-11-29 16:39:44 UTC
Permalink
Hey Gabe,

Are you thinking that a custom allocator would make a difference in terms
of memory footprint or in terms of performance (or both)?

A couple of thoughts:
- I'm hesitant to put the final keyword on Packet. I think we could see
some code cleanup by making Packet a polymorphic object (e.g., Daniel has
been talking about implementing packets with compressed data). Obviously,
using polymorphism isn't required, and I don't have any plans to make this
change in the near future. But it's something that I've been thinking about.
- I like the idea of requiring the use of a small function to allocate
requests and packets. In many cases, for Packets, we use
Packet::createRead()/createWrite(), which is a step in that direction. I
think this would be a relatively non-controversial change moving in the
direction of custom allocators. It may also allow us to do some testing on
this more easily.

Another argument against custom allocators is that we'll be adding a
significant amount of error-prone code. I like the idea of using standard
libraries as much as possible to aid in understanding the code and reducing
the bug surface.

That said, if we get a 20% speedup or a reduction in memory footprint then
this change would likely be worth it!

Cheers,
Jason
Post by Gabe Black
I was just wondering whether it would make sense to have custom allocators
defined for Request and Packet types which would keep around a pool of them
rather than defaulting to the normal allocator. I suspect since both types
of objects are allocated very frequently this could save a lot of heap
overhead.
I think there are two complications to doing this. First, we'd want to make
sure Request and Packet objects are truly those objects and are of the
appropriate size. Subclasses could add new members and make the types
bigger. I think to ensure that, we'd need to add the "final" keyword (added
in C++11) to ensure those types can't be subclassed and have unpredictable
sizes.
Second, we use make_shared a lot to build Requests, and that actually
allocates a larger amount of memory to hold the object and reference count
information. It allocates that larger object with new, and it looks like
you're supposed to use allocate_shared if you want to use a custom
allocator. Writing a custom allocator looks relatively complicated, but it
might make a big difference if it works correctly.
Also we'd probably want to put the Request allocator incantation in a small
function rather than calling make_shared or allocate_shared directly to
avoid a lot of boilerplate and churn if things need to change.
Thoughts? Like I said I suspect this may make a significant difference, but
it might not be easy to implement and may not actually make much of a
difference at all.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Nikos Nikoleris
2018-11-29 17:21:13 UTC
Permalink
This post might be inappropriate. Click to display it.
Daniel Carvalho
2018-11-29 17:43:19 UTC
Permalink
This post might be inappropriate. Click to display it.
Gabe Black
2018-11-29 23:07:45 UTC
Permalink
I looked at the code as is, and Packets themselves aren't ever inherited by
anything, and they maintain a stack (I think) of sender state objects which
doesn't affect the size of the packet. The sender state is inherited in a
bunch of places since that is often customized. As far as compressed
packets, what would the benefit be? It seems like there would be a lot of
overhead to compress and decompress them, and I don't think they ever get
all that big. I'm also hesitant to add "final" in there since I think it
would be potentially limiting, but sometimes sensible limits are useful :-).

The reason exactly how big Request/Packets are is a concern is that I think
you'd get a meaningful performance benefit if you could use extra knowledge
to do things faster than just using new. For instance, if you knew that
what you were allocating was always X bytes, then you could just have a
bunch of blocks of X bytes laying around in a list and return one when
asked. You wouldn't have to try to keep things compacted like new would, or
keep track of (or even check) how big things were. That marginal bit of
savings could make a big difference if allocations happened a lot,
particularly if the type of allocations were ones new (and/or malloc if
it's built on it) was bad at. My brief Googling says malloc is bad at lots
of small allocations, for what that's worth.

Also, a custom allocator could make some types of debugging marginally
harder. For instance, if you used valgrind, it would replace malloc/new,
and that wouldn't be used in the typical way for those types. They would
likely be mostly batch allocated at the start of simulation, and then just
returned by the various mechanisms later. That would make information about
where a particular Request object came from less useful, although you would
still likely be able to see that it was accessed in an illegal way, and
where that access happened.

I agree that taking an incremental approach and doing some experiments are
worth while, and that wrapping request allocation in a small function would
be a good first step. I can't promise a timeline, but hopefully I'll have
something to share at some point in the future.

Gabe
Post by Nikos Nikoleris
Hey all,
The bigger opportunity is probably around packets as sometimes we are
allocating a number of them per request.
A custom allocator would be a great contribution if it would make gem5
substantially faster. But if it proved make things only marginally
better, I agree with Jason, we would need to factor in the fact that we
will be making the code more fragile. Would debugging/profiling be
harder if we used a custom deallocator?
If the size of the Packet is a concern, we will probably have to deal
with members that are not fixed in size such as the senderState. It
might also be worth looking into something similar for the pointers from
packets such as data.
Packet and Request memory management is something that we have been
concerned about as well. Giacomo, recently, changed all naked Request
pointers for shared_ptr and now we don't have to manually deallocate
them. But when we tried to do this for Packet pointers (figuring out
when to de-allocate a packet is not trivial) there was a substantial
overhead.
Nikos
Post by Jason Lowe-Power
Hey Gabe,
Are you thinking that a custom allocator would make a difference in terms
of memory footprint or in terms of performance (or both)?
- I'm hesitant to put the final keyword on Packet. I think we could see
some code cleanup by making Packet a polymorphic object (e.g., Daniel has
been talking about implementing packets with compressed data). Obviously,
using polymorphism isn't required, and I don't have any plans to make
this
Post by Jason Lowe-Power
change in the near future. But it's something that I've been thinking
about.
Post by Jason Lowe-Power
- I like the idea of requiring the use of a small function to allocate
requests and packets. In many cases, for Packets, we use
Packet::createRead()/createWrite(), which is a step in that direction. I
think this would be a relatively non-controversial change moving in the
direction of custom allocators. It may also allow us to do some testing
on
Post by Jason Lowe-Power
this more easily.
Another argument against custom allocators is that we'll be adding a
significant amount of error-prone code. I like the idea of using standard
libraries as much as possible to aid in understanding the code and
reducing
Post by Jason Lowe-Power
the bug surface.
That said, if we get a 20% speedup or a reduction in memory footprint
then
Post by Jason Lowe-Power
this change would likely be worth it!
Cheers,
Jason
Post by Gabe Black
I was just wondering whether it would make sense to have custom
allocators
Post by Jason Lowe-Power
Post by Gabe Black
defined for Request and Packet types which would keep around a pool of
them
Post by Jason Lowe-Power
Post by Gabe Black
rather than defaulting to the normal allocator. I suspect since both
types
Post by Jason Lowe-Power
Post by Gabe Black
of objects are allocated very frequently this could save a lot of heap
overhead.
I think there are two complications to doing this. First, we'd want to
make
Post by Jason Lowe-Power
Post by Gabe Black
sure Request and Packet objects are truly those objects and are of the
appropriate size. Subclasses could add new members and make the types
bigger. I think to ensure that, we'd need to add the "final" keyword
(added
Post by Jason Lowe-Power
Post by Gabe Black
in C++11) to ensure those types can't be subclassed and have
unpredictable
Post by Jason Lowe-Power
Post by Gabe Black
sizes.
Second, we use make_shared a lot to build Requests, and that actually
allocates a larger amount of memory to hold the object and reference
count
Post by Jason Lowe-Power
Post by Gabe Black
information. It allocates that larger object with new, and it looks like
you're supposed to use allocate_shared if you want to use a custom
allocator. Writing a custom allocator looks relatively complicated, but
it
Post by Jason Lowe-Power
Post by Gabe Black
might make a big difference if it works correctly.
Also we'd probably want to put the Request allocator incantation in a
small
Post by Jason Lowe-Power
Post by Gabe Black
function rather than calling make_shared or allocate_shared directly to
avoid a lot of boilerplate and churn if things need to change.
Thoughts? Like I said I suspect this may make a significant difference,
but
Post by Jason Lowe-Power
Post by Gabe Black
it might not be easy to implement and may not actually make much of a
difference at all.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy the
information in any medium. Thank you.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Jason Lowe-Power
2018-11-30 01:17:42 UTC
Permalink
Couple more comments below.
Post by Gabe Black
I looked at the code as is, and Packets themselves aren't ever inherited by
anything, and they maintain a stack (I think) of sender state objects which
doesn't affect the size of the packet. The sender state is inherited in a
bunch of places since that is often customized. As far as compressed
packets, what would the benefit be? It seems like there would be a lot of
overhead to compress and decompress them, and I don't think they ever get
all that big. I'm also hesitant to add "final" in there since I think it
would be potentially limiting, but sometimes sensible limits are useful :-).
Not to speak for Daniel too much, but I think he was planning on modeling
the *data* in the packet being compressed. Then, there may be a different
packet size depending on the compression ratio of the data, for instance. I
think this would be cleaner to implement by inheriting from packet than
adding the interface to the current Packet. But, this isn't a hard
requirement. It was just an example of why we might want to inherit from
Packet at some point.
Post by Gabe Black
The reason exactly how big Request/Packets are is a concern is that I think
you'd get a meaningful performance benefit if you could use extra knowledge
to do things faster than just using new. For instance, if you knew that
what you were allocating was always X bytes, then you could just have a
bunch of blocks of X bytes laying around in a list and return one when
asked. You wouldn't have to try to keep things compacted like new would, or
keep track of (or even check) how big things were. That marginal bit of
savings could make a big difference if allocations happened a lot,
particularly if the type of allocations were ones new (and/or malloc if
it's built on it) was bad at. My brief Googling says malloc is bad at lots
of small allocations, for what that's worth.
We use tcmalloc as our default allocator right now. I wonder how that
compares. IIRC, when we moved from malloc to tcmalloc we got a ~10%
performance boost. At least, that's what Hansson said ;).
Post by Gabe Black
Also, a custom allocator could make some types of debugging marginally
harder. For instance, if you used valgrind, it would replace malloc/new,
and that wouldn't be used in the typical way for those types. They would
likely be mostly batch allocated at the start of simulation, and then just
returned by the various mechanisms later. That would make information about
where a particular Request object came from less useful, although you would
still likely be able to see that it was accessed in an illegal way, and
where that access happened.
This is a big downside. I've found memory errors with valgrind in packets
*many times* due to how we copy data in to/out of them. This might be a
show stopper for me.

As a side note, tcmalloc has bitten me a number of times here. I've run
valgind with tcmalloc and seen "0 bytes allocated and 0 bytes deleted"
after waiting for hours too many times. I have finally learned to recompile
everything using "--without-tcmalloc" before running valgind.
Post by Gabe Black
I agree that taking an incremental approach and doing some experiments are
worth while, and that wrapping request allocation in a small function would
be a good first step. I can't promise a timeline, but hopefully I'll have
something to share at some point in the future.
Gabe
Post by Nikos Nikoleris
Hey all,
The bigger opportunity is probably around packets as sometimes we are
allocating a number of them per request.
A custom allocator would be a great contribution if it would make gem5
substantially faster. But if it proved make things only marginally
better, I agree with Jason, we would need to factor in the fact that we
will be making the code more fragile. Would debugging/profiling be
harder if we used a custom deallocator?
If the size of the Packet is a concern, we will probably have to deal
with members that are not fixed in size such as the senderState. It
might also be worth looking into something similar for the pointers from
packets such as data.
Packet and Request memory management is something that we have been
concerned about as well. Giacomo, recently, changed all naked Request
pointers for shared_ptr and now we don't have to manually deallocate
them. But when we tried to do this for Packet pointers (figuring out
when to de-allocate a packet is not trivial) there was a substantial
overhead.
Nikos
Post by Jason Lowe-Power
Hey Gabe,
Are you thinking that a custom allocator would make a difference in
terms
Post by Nikos Nikoleris
Post by Jason Lowe-Power
of memory footprint or in terms of performance (or both)?
- I'm hesitant to put the final keyword on Packet. I think we could see
some code cleanup by making Packet a polymorphic object (e.g., Daniel
has
Post by Nikos Nikoleris
Post by Jason Lowe-Power
been talking about implementing packets with compressed data).
Obviously,
Post by Nikos Nikoleris
Post by Jason Lowe-Power
using polymorphism isn't required, and I don't have any plans to make
this
Post by Jason Lowe-Power
change in the near future. But it's something that I've been thinking
about.
Post by Jason Lowe-Power
- I like the idea of requiring the use of a small function to allocate
requests and packets. In many cases, for Packets, we use
Packet::createRead()/createWrite(), which is a step in that direction.
I
Post by Nikos Nikoleris
Post by Jason Lowe-Power
think this would be a relatively non-controversial change moving in the
direction of custom allocators. It may also allow us to do some testing
on
Post by Jason Lowe-Power
this more easily.
Another argument against custom allocators is that we'll be adding a
significant amount of error-prone code. I like the idea of using
standard
Post by Nikos Nikoleris
Post by Jason Lowe-Power
libraries as much as possible to aid in understanding the code and
reducing
Post by Jason Lowe-Power
the bug surface.
That said, if we get a 20% speedup or a reduction in memory footprint
then
Post by Jason Lowe-Power
this change would likely be worth it!
Cheers,
Jason
Post by Gabe Black
I was just wondering whether it would make sense to have custom
allocators
Post by Jason Lowe-Power
Post by Gabe Black
defined for Request and Packet types which would keep around a pool of
them
Post by Jason Lowe-Power
Post by Gabe Black
rather than defaulting to the normal allocator. I suspect since both
types
Post by Jason Lowe-Power
Post by Gabe Black
of objects are allocated very frequently this could save a lot of heap
overhead.
I think there are two complications to doing this. First, we'd want to
make
Post by Jason Lowe-Power
Post by Gabe Black
sure Request and Packet objects are truly those objects and are of the
appropriate size. Subclasses could add new members and make the types
bigger. I think to ensure that, we'd need to add the "final" keyword
(added
Post by Jason Lowe-Power
Post by Gabe Black
in C++11) to ensure those types can't be subclassed and have
unpredictable
Post by Jason Lowe-Power
Post by Gabe Black
sizes.
Second, we use make_shared a lot to build Requests, and that actually
allocates a larger amount of memory to hold the object and reference
count
Post by Jason Lowe-Power
Post by Gabe Black
information. It allocates that larger object with new, and it looks
like
Post by Nikos Nikoleris
Post by Jason Lowe-Power
Post by Gabe Black
you're supposed to use allocate_shared if you want to use a custom
allocator. Writing a custom allocator looks relatively complicated,
but
Post by Nikos Nikoleris
it
Post by Jason Lowe-Power
Post by Gabe Black
might make a big difference if it works correctly.
Also we'd probably want to put the Request allocator incantation in a
small
Post by Jason Lowe-Power
Post by Gabe Black
function rather than calling make_shared or allocate_shared directly
to
Post by Nikos Nikoleris
Post by Jason Lowe-Power
Post by Gabe Black
avoid a lot of boilerplate and churn if things need to change.
Thoughts? Like I said I suspect this may make a significant
difference,
Post by Nikos Nikoleris
but
Post by Jason Lowe-Power
Post by Gabe Black
it might not be easy to implement and may not actually make much of a
difference at all.
Gabe
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
IMPORTANT NOTICE: The contents of this email and any attachments are
confidential and may also be privileged. If you are not the intended
recipient, please notify the sender immediately and do not disclose the
contents to any other person, use it for any purpose, or store or copy
the
Post by Nikos Nikoleris
information in any medium. Thank you.
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
_______________________________________________
gem5-dev mailing list
http://m5sim.org/mailman/listinfo/gem5-dev
Gabe Black
2018-11-30 01:41:14 UTC
Permalink
Post by Jason Lowe-Power
As a side note, tcmalloc has bitten me a number of times here. I've run
valgind with tcmalloc and seen "0 bytes allocated and 0 bytes deleted"
after waiting for hours too many times. I have finally learned to recompile
everything using "--without-tcmalloc" before running valgind.
Yeah, I was thinking before the way to set everything up for valgrind is a
little too non-obvious. It would be nice to have some small scripts or
something to help set that up.

Gabe

Loading...