Skip to content

Conversation

@tclinken
Copy link
Contributor

Since storage queue nodes account for a large portion of memory usage,
we can save space by only allocating 96 bytes instead of 128 bytes for
each node.

Copy link
Contributor

@ajbeamon ajbeamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some spots (3?) where we log metrics about each fast allocator. We should add this to those as well.

Copy link
Contributor

@ajbeamon ajbeamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a peek at some running clusters to see how much memory this might account for. It looks like we top out at about 1GB memory allocated in 128 byte pools, which would mean that we could save up to 250MB per process in these clusters if all of that memory was being used for this purpose. It looks like our average use in this range is a good bit less (the cluster with the highest average came in under 600MB).

This will come at the cost of having potentially unused memory resident in an additional pool, which based on a look at similarly sized pools in these clusters could be up to a few hundred MBs, although on average it's also a good bit less.

Have you run any tests that measure the difference in memory used with and without this pool over some extended period? What did it look like?

}

static const int overheadPerItem = 128*4;
static const int overheadPerItem = 96*4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not really any documentation for how this overhead is getting calculated, but I assume you have some idea since you're changing it. If so, would you mind adding a comment to explain what this is based on? As an aside, if it's based on the fast allocated size of PTree objects, it seems like that wouldn't be fixed given that some of the types involved are templated. It looks like we only use this variable in one place where the types would be fixed, so maybe that's how we'd get away with it?

@etschannen
Copy link
Contributor

You are missing calls to TRACEALLOCATOR and DETAILALLOCATORMEMUSAGE

@tclinken
Copy link
Contributor Author

@ajbeamon: We don't have performance test results yet, which is why I left the review in draft mode. @kaomakino is planning to run some tests on this branch to see if this is a worthwhile optimization.

Since storage queue nodes account for a large portion of memory usage,
we can save space by only allocating 96 bytes instead of 128 bytes for
each node.
@tclinken tclinken force-pushed the fast-allocate-ptree-nodes branch from 759ad5a to 4293a3e Compare March 25, 2019 20:46
@tclinken tclinken marked this pull request as ready for review April 10, 2019 20:27
@tclinken
Copy link
Contributor Author

tclinken commented May 15, 2019

This change has now been benchmarked with an insert-only workload and saw an increase in throughput, due to a smaller storage queue and thus less aggressive ratekeeper.

@ajbeamon
Copy link
Contributor

If the benefit of this is just that the storage queue is smaller, does the performance increase imply that the bottleneck in this case was that the input rate in 5 seconds is large enough to fill the queue? Or I guess it could potentially be longer, if you have a larger durability lag.

If that is the bottleneck, are you seeing this as a sustained improvement, or only during short bursts? I think the expectation is that the disk would be limiting rather than the size of the queue, but if you are running into queue size issues an increase in the queue size may also be warranted.

I don't intend any of this to mean that we shouldn't make this change, by the way, I'm just interested in what we might expect the effects to be in various scenarios.

@tclinken
Copy link
Contributor Author

Mainly the benefit is just that the storage queue is smaller, so bursts of writes will not result in throttling as quickly. In the long run, with a steady workload, writing to disk is the bottleneck, so we're not seeing throughput increase in this case.

@ajbeamon ajbeamon merged commit a8b9d8e into apple:master May 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants