Reduce recovery times caused by saturating the cluster controller #2430

etschannen · 2019-12-09T21:19:46Z

No description provided.

…rs to avoid having a saturated cluster controller recruit a master without all available workers

…ruitment over sending serverDBInfo

…luster controller if possible

…e part of the new generation

…ase-6.2

ajbeamon · 2019-12-20T16:26:56Z

fdbserver/ClusterController.actor.cpp

-		for(int i = 0; i < oldLogRouters.size(); i++) {
-			result.oldLogRouters.push_back(oldLogRouters[i].interf);
+		if(req.maxOldLogRouters > 0) {
+			auto oldLogRouters = tlogs;


This seems to be assuming that req.maxOldLogRouters is less than or equal to tlogs.size+1. Either that, or we're just ignoring maxOldLogRouters.

Can we make that an ASSERT if that's the case?

The code is already written so that it can recruit multiple log routers on the same worker, so the maxOldLogRouters was a suggested maximum number of workers, but returning fewer is okay

My concern was the opposite -- namely what if maxOldLogRouters is smaller than the number of logs. If such a thing were possible, this would result in there being more old log routers than the max.

looking at the code, the list of workers does not determine the number of log routers, so in that case it would just ignore the additional interfaces

ajbeamon · 2019-12-20T16:35:07Z

fdbserver/ClusterController.actor.cpp

+				}
+				if(foundCC) {
+					result.oldLogRouters.push_back(oldLogRouters.back().interf);
+				}


So this logic means that we will always have 1 fewer old log router than tlogs (except when there's only 1)? Is there any reason not to use all of the tlogs when none of them are the cluster controller?

If what's written is the intent, the else block could be simplified to something like:

for(int i = 0; i < oldLogRouters.size() && result.oldRouters.size() < oldLogRouters.size() - 1; ++i) { if(oldLogRouters[i].interf.locality.processId() != clusterControllerProcessId) { result.oldLogRouters.push_back(oldLogRouters[i].interf); } }

ajbeamon · 2019-12-20T16:48:19Z

fdbserver/ClusterController.actor.cpp

+			for(auto& it : rep.oldLogRouters) {
+				self->db.requiredAddresses.insert(it.address());
+				if( it.tLog.getEndpoint().addresses.secondaryAddress.present() ) self->db.requiredAddresses.insert(it.tLog.getEndpoint().addresses.secondaryAddress.get());
+			}


Since this loop is written a bunch of times, it might be a good candidate for extracting into a function.

code cleanup

etschannen added 5 commits December 4, 2019 16:17

increase the priority at which the cluster controller registers worke…

5f1ef53

…rs to avoid having a saturated cluster controller recruit a master without all available workers

increase the priority of cluster controller recruitment to prefer rec…

5a6bc2a

…ruitment over sending serverDBInfo

recruit oldLogRouters on TLogs, do not recruit oldLogRouters on the c…

bcce596

…luster controller if possible

during recovery, only send the full serverDBInfo to processes that ar…

5e5e618

…e part of the new generation

Merge branch 'release-6.2' of github.com:apple/foundationdb into rele…

3c30215

…ase-6.2

etschannen requested a review from ajbeamon December 9, 2019 21:19

alexmiller-apple assigned ajbeamon Dec 18, 2019

ajbeamon reviewed Dec 20, 2019

View reviewed changes

fix: we were recruiting one too few oldLogRouters

3eae401

code cleanup

etschannen merged commit 032797c into apple:release-6.2 Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce recovery times caused by saturating the cluster controller #2430

Reduce recovery times caused by saturating the cluster controller #2430

Uh oh!

etschannen commented Dec 9, 2019

Uh oh!

ajbeamon Dec 20, 2019

Uh oh!

etschannen Jan 2, 2020 •

edited

Loading

Uh oh!

ajbeamon Jan 2, 2020 •

edited

Loading

Uh oh!

etschannen Jan 2, 2020

Uh oh!

ajbeamon Dec 20, 2019

Uh oh!

ajbeamon Dec 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Reduce recovery times caused by saturating the cluster controller #2430

Reduce recovery times caused by saturating the cluster controller #2430

Uh oh!

Conversation

etschannen commented Dec 9, 2019

Uh oh!

ajbeamon Dec 20, 2019

Choose a reason for hiding this comment

Uh oh!

etschannen Jan 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajbeamon Jan 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

etschannen Jan 2, 2020

Choose a reason for hiding this comment

Uh oh!

ajbeamon Dec 20, 2019

Choose a reason for hiding this comment

Uh oh!

ajbeamon Dec 20, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

etschannen Jan 2, 2020 •

edited

Loading

ajbeamon Jan 2, 2020 •

edited

Loading