Local Ratekeeper #1477

tclinken · 2019-04-22T17:43:11Z

This is a continuation of PR #1229. It fixes some bugs from that pull request.

Co-Authored-By: mpilman <[email protected]>

checkAndProcessResult must handle errors in LoadBalancedReply objects in the same way other errors are handled.

kaomakino · 2019-05-08T01:12:51Z

These charts show how the local ratekeeper protects us from the heavily skewed read write workload.
I used a highly skewed workload which generates heavy range reads and writes targeting a particular single team. Without the local ratekeeper, when a storage server gets too many read requests, it cannot process pending write tasks, so the NDV won't decrease. With the local ratekeeper, when NDV goes up to certain threshold, it starts throttling the reads (Shown in the most left 2 charts). While the reads are throttled, the storage server can process the write tasks, so the NDV can go down.

In comparison, here are the charts with the current master with the same workload.

Note that NDV does not come down because of the heavy incoming reads.

xumengpanda · 2019-05-08T04:50:54Z

These charts show how the local ratekeeper protects us from the heavily skewed read write workload.
I used a highly skewed workload which generates heavy range reads and writes targeting a particular single team. Without the local ratekeeper, when a storage server gets too many read requests, it cannot process pending write tasks, so the NDV won't decrease. With the local ratekeeper, when NDV goes up to certain threshold, it starts throttling the reads (Shown in the most left 2 charts). While the reads are throttled, the storage server can process the write tasks, so the NDV can go down.

In comparison, here are the charts with the current master with the same workload.

Note that NDV does not come down because of the heavy incoming reads.

This is an insightful evaluation.

While I'm trying to understand the figures in the evaluations, I'm confused at what each line means in the figures ReadOPS, WriteOPS and NDV. Maybe you posted the legend for those figures before but I didn't know where it is?

kaomakino · 2019-05-08T05:12:52Z

ReadOPS = Read operations / second
WriteOPS = Write operations / second
NDV = Non-Durable Version (= Current Version - Durable Version)
These are time-series chart. X-axis is time in seconds.

xumengpanda · 2019-05-08T05:15:02Z

ReadOPS = Read operations / second
WriteOPS = Write operations / second
NDV = Non-Durable Version (= Current Version - Durable Version)
These are time-series chart. X-axis is time in seconds.

I think I didn't explain myself clearly. I understood those abbreviations.

What I asked is what do the lines in different colors mean in each of those figures.
I guess each line in the figures is the data for one client?

kaomakino · 2019-05-08T15:41:38Z

Ah, got it. Each line represents each storage process. In the NDV chart, there are 3 lines moving because one team (triple redundancy) is being hot.

kaomakino · 2019-05-08T15:45:16Z

In addition to those per-server-process charts, the "Local Ratekeeper" chart shows the aggregate of the "probability of serving reads" of all storage processes. The "Ratekeeper" chart and "Worst Storage Queue" are taken from the ratekeeper metrics.

etschannen · 2019-05-23T01:41:23Z

My impression for local ratekeeper was that ratekeeper would only allow two (with triple replication) storage servers that they were allowed to start limiting their reads.

This implementation lets all storage servers limit themselves, and additionally will cause the main ratekeeper to throttle if they fall too far behind. With this implementation I am concerned that a saturating workload could cause all storage servers to each decide to throttle reads. This would lead to higher read latencies, which could cause transactions to start taking longer than 5 seconds, and lead to a death spiral.

etschannen · 2019-05-23T19:28:43Z

fdbserver/storageserver.actor.cpp


 	double getPenalty() {
-		 return std::max(1.0, (queueSize() - (SERVER_KNOBS->TARGET_BYTES_PER_STORAGE_SERVER - 2.0*SERVER_KNOBS->SPRING_BYTES_STORAGE_SERVER)) / SERVER_KNOBS->SPRING_BYTES_STORAGE_SERVER);
+		return std::max(std::min(1.0, (queueSize() - (SERVER_KNOBS->TARGET_BYTES_PER_STORAGE_SERVER -


The penalty should always be larger than 1. Load balance works by trying to keep an equal number of requests outstanding to all servers, a penalty makes a single outstanding request act like more than one request so you send less traffic to that server. I think the formula you want for penalty would be 1.0/(1.0-currentRate()), and everything should me max instead of having a min.

Penalty should always be >= 1.0

…ts large enough

etschannen · 2019-06-11T18:53:49Z

fdbserver/storageserver.actor.cpp

 }

 ACTOR Future<Void> updateStorage(StorageServer* data) {
+	state std::string waitDescription = format("%s/updateStorage", data->thisServerID.toString().c_str());


pass this string directly into checkDisabled so we do not construct it when not in simulation

sfc-gh-mpilman and others added 14 commits April 8, 2019 11:04

Prototype of local ratekeeper

32393ec

fixed serialization

207049e

generalized read guards, allow for penalty+error

b944e0b

Added test and bugfixes

bdba8e2

fixed missing refactoring code

aaa8f73

Fix stupid rounding error

d2e74cb

Update fdbserver/workloads/LocalRatekeeper.actor.cpp

b286102

Co-Authored-By: mpilman <[email protected]>

Fixed typo

c45fe8c

Fixed getPenalty calculation

0594154

Fixed name of LocalRatekeeperWorkloadFactory

8a7d9af

readGuard sends server_overloaded error if request is rejected

1d921da

Fixed readGuard usage bug

3426205

Fixed checkAndProcessResult

2967ceb

checkAndProcessResult must handle errors in LoadBalancedReply objects in the same way other errors are handled.

Do not rethrow server_overloaded error from load balancer

2ad6d4d

mpilman mentioned this pull request Apr 22, 2019

Prototype for local Ratekeeper #1229

Closed

hgray1 assigned etschannen Apr 22, 2019

Fix currentRate calculation for local ratekeeper

d339bec

etschannen reviewed May 23, 2019

View reviewed changes

tclinken added 2 commits June 5, 2019 14:14

Changed storage server getPenalty calculation.

d1d98f2

Penalty should always be >= 1.0

Don't reject read requests until the storage server durability lag ge…

8dbb231

…ts large enough

tclinken force-pushed the features/local-rk branch from 6f0af55 to 8dbb231 Compare June 5, 2019 22:43

tclinken added 2 commits June 10, 2019 18:25

Fixed LocalRatekeeper test

46b7781

Merge branch 'apple-master' into features/local-rk

8144882

etschannen reviewed Jun 11, 2019

View reviewed changes

Only construct waitDescription in simulator

cb420ea

etschannen merged commit 9fdbf0c into apple:master Jun 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Local Ratekeeper #1477

Local Ratekeeper #1477

Uh oh!

tclinken commented Apr 22, 2019 •

edited

Loading

Uh oh!

kaomakino commented May 8, 2019

Uh oh!

xumengpanda commented May 8, 2019

Uh oh!

kaomakino commented May 8, 2019

Uh oh!

xumengpanda commented May 8, 2019

Uh oh!

kaomakino commented May 8, 2019

Uh oh!

kaomakino commented May 8, 2019

Uh oh!

etschannen commented May 23, 2019

Uh oh!

etschannen May 23, 2019

Uh oh!

etschannen Jun 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Local Ratekeeper #1477

Local Ratekeeper #1477

Uh oh!

Conversation

tclinken commented Apr 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaomakino commented May 8, 2019

Uh oh!

xumengpanda commented May 8, 2019

Uh oh!

kaomakino commented May 8, 2019

Uh oh!

xumengpanda commented May 8, 2019

Uh oh!

kaomakino commented May 8, 2019

Uh oh!

kaomakino commented May 8, 2019

Uh oh!

etschannen commented May 23, 2019

Uh oh!

etschannen May 23, 2019

Choose a reason for hiding this comment

Uh oh!

etschannen Jun 11, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tclinken commented Apr 22, 2019 •

edited

Loading