-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Create 96-byte fast allocator for storage queue PTree nodes #1336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ajbeamon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some spots (3?) where we log metrics about each fast allocator. We should add this to those as well.
ajbeamon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a peek at some running clusters to see how much memory this might account for. It looks like we top out at about 1GB memory allocated in 128 byte pools, which would mean that we could save up to 250MB per process in these clusters if all of that memory was being used for this purpose. It looks like our average use in this range is a good bit less (the cluster with the highest average came in under 600MB).
This will come at the cost of having potentially unused memory resident in an additional pool, which based on a look at similarly sized pools in these clusters could be up to a few hundred MBs, although on average it's also a good bit less.
Have you run any tests that measure the difference in memory used with and without this pool over some extended period? What did it look like?
fdbclient/VersionedMap.h
Outdated
| } | ||
|
|
||
| static const int overheadPerItem = 128*4; | ||
| static const int overheadPerItem = 96*4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not really any documentation for how this overhead is getting calculated, but I assume you have some idea since you're changing it. If so, would you mind adding a comment to explain what this is based on? As an aside, if it's based on the fast allocated size of PTree objects, it seems like that wouldn't be fixed given that some of the types involved are templated. It looks like we only use this variable in one place where the types would be fixed, so maybe that's how we'd get away with it?
|
You are missing calls to TRACEALLOCATOR and DETAILALLOCATORMEMUSAGE |
|
@ajbeamon: We don't have performance test results yet, which is why I left the review in draft mode. @kaomakino is planning to run some tests on this branch to see if this is a worthwhile optimization. |
Since storage queue nodes account for a large portion of memory usage, we can save space by only allocating 96 bytes instead of 128 bytes for each node.
759ad5a to
4293a3e
Compare
|
This change has now been benchmarked with an insert-only workload and saw an increase in throughput, due to a smaller storage queue and thus less aggressive ratekeeper. |
|
If the benefit of this is just that the storage queue is smaller, does the performance increase imply that the bottleneck in this case was that the input rate in 5 seconds is large enough to fill the queue? Or I guess it could potentially be longer, if you have a larger durability lag. If that is the bottleneck, are you seeing this as a sustained improvement, or only during short bursts? I think the expectation is that the disk would be limiting rather than the size of the queue, but if you are running into queue size issues an increase in the queue size may also be warranted. I don't intend any of this to mean that we shouldn't make this change, by the way, I'm just interested in what we might expect the effects to be in various scenarios. |
|
Mainly the benefit is just that the storage queue is smaller, so bursts of writes will not result in throttling as quickly. In the long run, with a steady workload, writing to disk is the bottleneck, so we're not seeing throughput increase in this case. |
Since storage queue nodes account for a large portion of memory usage,
we can save space by only allocating 96 bytes instead of 128 bytes for
each node.