Skip to content

Conversation

@nik-io
Copy link
Contributor

@nik-io nik-io commented Apr 30, 2019

Thought it wouldn't hurt switching to an LRU policy for the page cache of the AsyncFileCached actor.

  • Replaced the vector-based data structure with a boost intrusive doubly linked list that implements the LRU list.
  • Added an updateHit API to the EvictablePageCache class to update the list on a cache hit.
  • Operations on the LRU list should have constant complexity (push_back, erase, s_iterator_to, https://www.boost.org/doc/libs/1_67_0/doc/html/boost/intrusive/list.html), so this patch should not introduce additional overhead w.r.t. the previous eviction policy.
  • Preliminary tests on a skewed read workload show a modest increase in hit rate of 10% or less (measured with hit/miss counters that are not part of this patch), and never worse than the previous eviction policy.

@mpilman
Copy link
Contributor

mpilman commented Apr 30, 2019

I would strongly suggest that we benchmark this extensively.

In my experience, random caching works often much better than one might expect and LRU adds quite some non-trivial overhead (and CPU is a big problem for us).

At Snowflake we run with very large caches in production. The random eviction policy works great there.

Can you safe-guard this with a knob for now? We have to be able to disable and enable this dynamically until we confirmed that this doesn't introduce any performance regressions for us.

@nik-io
Copy link
Contributor Author

nik-io commented May 1, 2019

I will safe-guard this with a knob and update the request, no problem.

A couple of notes:

  • As far as I understand it, if you are using the sqlite based storage engine ("ssd-2") then the sqlite page cache (starting at sqlite3.amalgamation.c:2026) is also used on top of the AsyncFileCache page cache. The sqlite page cache uses a cache eviction scheme that has at least an lru component to it (sqlite3.amalgamation.c:4068).
  • Re CPU overhead: this patch replaces random accesses to an std::vector (and then pointer de-referencing of vector elements, which are page pointers) with the same number (or less since it's evaluating the oldest pages first, which are most likely not being read at the moment) of random memory accesses through linked list pointers. All the boost intrusive list operations used have constant complexity, so I don't see how this would be more cpu intensive than the existing eviction algorithm. If anything, I expect it to be slightly less cpu intensive, since it's avoiding the (implicit) std::vector resize operations, avoiding calls to randomInt, as well as avoiding the swap and erase from the vector at EvictablePage destructor.
  • I'd be happy to run a set of benchmarks deemed necessary for acceptance. We are typically running go-ycsb skewed and uniform, as well as some modified native fdb tests with a mixed RW ratio (https://forums.foundationdb.org/t/cluster-tuning-cookbook/520/26). So far, we haven't seen a huge bottom line performance difference with this patch. But looking at the page hit and miss (measured with hit/miss counters that are not part of this patch) the LRU was consistently better than the random replacement policy.

@mpilman
Copy link
Contributor

mpilman commented May 1, 2019

Awesome, thanks a lot for the knob-guard.

I think the sqlite cache is turned off (or it is a very small cache). But I am not sure anymore (would need to hunt this down in the code - @satherton might have a better idea here).

For performance: after looking at the boost stuff I tend to agree that it doesn't seem very likely that this is much more expensive than the current caching strategy. If this patch can improve cache hit rates, it will have a lot of value (for us a higher cache hit rate might save a lot of money as we run on EBS and EBS IOPS are not cheap). @kaomakino has agreed to run some benchmarks on this as soon as he has time.

If you have already done this (but not in this patch): would it be possible to report the cache hit rate? I was planning to do this anyways as we are very interested in this number, but if you have that already it would be great to include this in this patch (or a separate patch). Basically I would like to have something like N2_cache_hits and N2_cache_misses in ProcessMetrics. Calculating this out of the current reported metrics is a bit awkward...

@ajbeamon
Copy link
Contributor

ajbeamon commented May 1, 2019

something like N2_cache_hits and N2_cache_misses

The format for these detail names has changed in recent times, and should probably end up like CacheHits and CacheMisses.

I'll be curious if some of the existing metrics become redundant in this exercise. I don't remember the exact way they work, but I think we have something roughly like counters for how many reads are done in cache and logically at the filesystem. If it's the case that misses == file reads and hits+misses == cache reads, then the right thing may be rework all of the metrics surrounding file reads to something more sensible. Otherwise, all of these partially overlapping metrics may end up being even more confusing.

That said, my descriptions above are vague recollections, and if it turns out the old metrics provide some useful information that can't be gleaned from the proposed new set, it may make sense to keep all of them.

@alexmiller-apple
Copy link
Contributor

alexmiller-apple commented May 2, 2019

We are typically running go-ycsb skewed and uniform

Do be aware that go-ycsb is only one process, so you'll be bottlenecked on the network thread, and that it does one operation per transaction, so you'll actually just be benchmarking GRV performance rather than read performance.

@mpilman
Copy link
Contributor

mpilman commented May 2, 2019

| I'll be curious if some of the existing metrics become redundant in this exercise.

I think there is something, but last time it took me 30 minutes to figure out how to calculate cache miss rate from the ProcessMetrics. This is imho not great 😆 Cache hit rate is imho important enough to be in there. It also makes usage with Wavefront/Grafana (or whatever other tool you are using) much easier.

@nik-io
Copy link
Contributor Author

nik-io commented May 2, 2019

We are typically running go-ycsb skewed and uniform

Do be aware that go-ycsb is only one process, so you'll be bottlenecked on the network thread, and that it does one operation per transaction, so you'll actually just be benchmarking GRV performance rather than read performance.

Indeed, in our testing we are spawning several go-ycsb (multithreaded) client process instances in parallel because we have found (just as you suggest) that that was the only way to maximize read performance. A better way would be to modify go-ycsb, I guess, to have a separate socket per thread (which I assume can be done with a separate fdbopen per thread).

@nik-io
Copy link
Contributor Author

nik-io commented May 2, 2019

@satherton: I'll safe guard this with a knob and also add CacheHits and CacheMisses stats (as @mpilman and @ajbeamon suggested), and update the pull request as soon as I have it ready.

- Added a knob to control page cache eviction policy (0=random default, 1=lru)
- Added page cache hit, miss, and eviction stats
@ajbeamon
Copy link
Contributor

ajbeamon commented May 2, 2019

I think there is something, but last time it took me 30 minutes to figure out how to calculate cache miss rate from the ProcessMetrics. This is imho not great 😆 Cache hit rate is imho important enough to be in there. It also makes usage with Wavefront/Grafana (or whatever other tool you are using) much easier.

Agreed, I've wanted more easily interpretable cache statistics too. I just want to avoid having metrics called 'CacheReads' and 'CacheHits', where from the name it may get unclear what the difference is. If CacheReads becomes redundant in this process, I'd be all for getting rid of it.

@ajbeamon
Copy link
Contributor

ajbeamon commented May 2, 2019

Perhaps an interesting experiment would be to leave them all in and see what the numbers look like. That could provide some empirical evidence whether they are actually just combinations of each other.

@nik-io
Copy link
Contributor Author

nik-io commented May 2, 2019

@satherton I've updated the pull request:

  • safe guard the cache eviction change with a knob (defaults to the original random policy)
  • added page cache statistics (hits, misses, but also evictions). I'm not 100% sure that Int64MetricHandle was the way to go here but that's what I've pushed. If the stats should be implemented in some other way, please let me know and I'll amend.

@mpilman
Copy link
Contributor

mpilman commented May 2, 2019

@Nikolas-Ioannou Thanks - this looks really useful. I hope we can soon take a look at this and run our benchmarks with both caches.

@ajbeamon

That could provide some empirical evidence whether they are actually just combinations of each other.

Does this mean you don't quite know what some of these metrics mean?

I think we should document all metrics (StorageMetrics, ProcessMetrcis, MachineMetrics etc). I will create a ticket for that. This is imho very much necessary...

@ajbeamon
Copy link
Contributor

ajbeamon commented May 2, 2019

Does this mean you don't quite know what some of these metrics mean?

I think we should document all metrics (StorageMetrics, ProcessMetrcis, MachineMetrics etc). I will create a ticket for that. This is imho very much necessary...

I have a rough idea what they do, but I haven't carefully evaluated exactly how they relate to each other. It could be that there are some minor overlaps or gaps at the edges that I'm not aware of, so a quick test might be able to help calibrate my expectations.

Documenting metrics sounds valuable, these are probably some of the most important trace events to be able to work with.

@ben-manes
Copy link

If you can capture a workload trace from the cache's perspective (logging of the hash of the key lookups), then I can run a simulation to show the hit rate across various policies (Random, LRU, ARC, LIRS, etc). It will also give a sense of whether the workload is frequency or recency biased, or mixed.

YCSB and Zipf are very good for load testing with a more realistic distribution, so one can obtain a good estimate of scalability. However Zipf is too idealistic for judging a caching policy, other than the basics of if it captures some frequency characteristics. A real trace will have other behaviors to account for, like analytical scans (favors mru) or a job queue (favors lru).

Copy link
Contributor

@satherton satherton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I really would prefer a string policy identifier as it seems likely we will have more options in the future (at which point putting eviction policy behind an interface might be wise too).

flow/Knobs.h Outdated
int64_t SIM_PAGE_CACHE_64K;
int64_t BUGGIFY_SIM_PAGE_CACHE_4K;
int64_t BUGGIFY_SIM_PAGE_CACHE_64K;
int CACHE_EVICTION_POLICY; // 0 RANDOM (default), 1 LRU
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do support std::string knobs, and I think cache eviction policy is a good use for one. We could add other policies in the future and strings would be the most user-friendly way to specify them. A member in AsyncFileCached of type CacheEvictionType can then be initialized from the knob's string value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I will change the knob to be a string, add a member (evictionType) and initialize it based on the knob's string value, and update the patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satherton updated with the knob being a string.

@nik-io
Copy link
Contributor Author

nik-io commented May 3, 2019

If you can capture a workload trace from the cache's perspective (logging of the hash of the key lookups), then I can run a simulation to show the hit rate across various policies (Random, LRU, ARC, LIRS, etc). It will also give a sense of whether the workload is frequency or recency biased, or mixed....

I'm afraid we don't have a workload trace at the moment, as we are currently in adoption stage and generate workloads with ycsb.

@alexmiller-apple
Copy link
Contributor

@fdb-build, ok to test

But the CMake MacOS builder will fail because it's currently just broken.

@alexmiller-apple
Copy link
Contributor

@ben-manes, I appreciate the offer to run simulations against traces, and I've started the conversations internally to see from how close to production we would be allowed to produce such traces.

explicit EvictablePageCache(int pageSize, int64_t maxSize) : pageSize(pageSize), maxPages(maxSize / pageSize) {
std::string cep = FLOW_KNOBS->CACHE_EVICTION_POLICY;
std::transform(cep.begin(), cep.end(), cep.begin(), ::tolower);
if (0 == cep.compare("random"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An exception should be thrown if the eviction type is not recognized, but also it would be good to error out as soon as possible from fdbserver if the knob is set incorrectly.

To that end, I suggest making a public static helper function to convert a policy string to an enum, throwing an exception on failure (probably a new exception type). That function can then be used here, and we can also call it from fdbserver.cpp using the knob's current value after all of the knob arguments would have been processed and set.

Minor style comment - Personally I think cep == "random" is a bit more natural to read than 0 == cep.compare("random").

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satherton thanks for the input. I have updated the patch with your suggestions.

@ben-manes
Copy link

@alexmiller-apple It is also fine if you wished to run your traces and I can assist if needed. The trace readers are per-format and produce a stream of 64-bit integer keys, each representing an access. Typically you can capture the key's hash to obfuscate it, using either a text or binary log file. Adding a trace is probably less than a dozen lines of code, so pretty easy if your team decides to keep everything internal.

Copy link
Contributor

@satherton satherton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good, but there one last thing to do: add a call to evictionPolicyStringToEnum() in fdbserver.actor.cpp to validate the knob's current value after knob arguments have been set.

I believe the first opportunity is here: https://github.com/apple/foundationdb/blob/master/fdbserver/fdbserver.actor.cpp#L1428

@nik-io
Copy link
Contributor Author

nik-io commented May 7, 2019

These changes look good, but there one last thing to do: add a call to evictionPolicyStringToEnum() in fdbserver.actor.cpp to validate the knob's current value after knob arguments have been set.

I believe the first opportunity is here: https://github.com/apple/foundationdb/blob/master/fdbserver/fdbserver.actor.cpp#L1428

@satherton updated with your requested changes.

@AlvinMooreSr
Copy link
Contributor

test this please

@nik-io
Copy link
Contributor Author

nik-io commented May 8, 2019

test this please

For what it's worth, this is how we tested this using go-ycsb with a page cache size that is ~10% of the dataset:

  1. spawn fdbserver:
    ./bin/fdbserver --datadir /mnt/fdb-nvme/foundationdb/4500 --listen_address public --logdir /mnt/fdb-nvme/foundationdb --public_address auto:4500 --knob_cache_eviction_policy lru --knob_page_cache_4k $((10*1024*1024)) -L .
  2. load go-ycsb with 1M 100B values
    ./bin/go-ycsb load foundationdb -P workloads/workloadRO-small -p threadCount=128 -p measurement.interval=1
  3. stop fdbserver. start one with random eviction:
    ./bin/fdbserver --datadir /mnt/fdb-nvme/foundationdb/4500 --listen_address public --logdir /mnt/fdb-nvme/foundationdb --public_address auto:4500 --knob_cache_eviction_policy random --knob_page_cache_4k $((10*1024*1024)) -L .
  4. run go-ycsb read test with skewed workload
    ./bin/go-ycsb run foundationdb -P workloads/workloadRO-small -p threadCount=128 -p measurement.interval=1
  5. Repeat 3 and 4 with lru eviction

Stat output from fdbserver log (aggregated using for i in Hits Misses Evictions; do cat trace.127.0.0.1.4500.1557322824.Eu8gSz.1.xml.1Mkeys.100B.1Mreads.10MiBcache.skewed.lru_eviction.read-only.out |grep Cache${i} | tr ' ' '\n' | grep Cache${i} | tr -d '"'| awk -F '=' 'BEGIN{sum=0}{sum+=$2}END{print sum}'; done):
Random
CacheHits=2543175
CacheMisses=1393667
CacheEvictions=1391158

LRU
CacheHits=2681335
CacheMisses=1255413
CacheEvictions=1252908

Attached below is the go-ycsb output for the two runs:
go-ycsb-output.zip

@ryanworl
Copy link
Collaborator

ryanworl commented May 8, 2019

test this please is a command to the CI system @Nikolas-Ioannou

@satherton satherton merged commit 5a8c974 into apple:master May 17, 2019
@nik-io nik-io deleted the feature-pagecache-lru branch May 20, 2019 14:51
@nik-io nik-io restored the feature-pagecache-lru branch September 2, 2020 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants